Join Classifier of Type and Index Mutation on Lung Cancer DNA Using Sequential Labeling Model

The sequential labeling model is commonly used for time series or sequence data where each instance label is classified using previous instance label. In this work, a sequential labeling model is proposed as a new approach to detect the type and index mutations simultaneously, using DNA sequences from lung cancer study cases. The methods used are One Dimensional Convolutional Neural Network (1D-CNN), Bidirectional Long Short-Term Memory (BiLSTM), and Bidirectional Gated Recurrent Unit (Bi-GRU). Each nucleotide in the patient’s DNA sequence is classified as either normal or with a certain type of mutation in which case, its index mutation is predicted. The mutation types detected are either substitution, insertion, deletion, or delins (deletion insertion) mutations. Based on the experiments that were conducted using <italic>EGFR</italic> gene, BiLSTM and Bi-GRU displayed better performance and were more stable than 1D-CNN. Further tests were carried out on the <italic>TP53</italic>, <italic>KRAS</italic>, <italic>CTNNB1</italic>, <italic>SMARCA4</italic>, <italic>CDKN2A</italic>, <italic>PTPRD</italic>, <italic>BRAF</italic>, <italic>ERBB2</italic>, and <italic>PTPRT</italic> gene. The proposed model reports F1-scores of 0.9596, and 0.9612 using Bi-GRU and BiLSTM, respectively. Based on the results the model can successfully detect the type and index mutations in the DNA sequence more accurately and faster without the need for other supporting data and tools, and does not require re-alignment to reference sequences. This will greatly facilitate the user in detecting type and index mutations faster by entering only the DNA sequence.

in the Deep Learning approach are Convolutional Neural Network [7], Bi-directional Long Short-Term Memory (BiLSTM) [8], Bidirectional Gated Recurrent Unit (Bi-GRU) [9] or a combination of these methods [10]. In its development, CNN, BiLSTM, and Bi-GRU have been widely used in the medical field, especially in DNA sequence data and have had a fairly good performance, including cancer prediction on gene expression data [11], variant calling in single molecule sequencing [12], and DNA binding site prediction [13], [14]. However, the approach that is used in this study is in the form of classifying a data sequence producing a single class, and there exists no study that uses sequential labeling model on DNA sequence data to achieve this.
Deoxyribonucleic acid (DNA) is a genetic code composed of adenine (A), cytosine (C), thymine (T), and guanine (G) [15], which instructs the functions of growth, metabolism, reproduction, and others in the body of living things. Each gene in DNA has a specific function, so mutations that occur in certain genes will cause certain diseases, for example, mutations in the EGFR gene are common in lung cancer cases [16]- [18]. In the field of bioinformatics, the mutation types and index detection is generally carried out using an alignment approach [19]- [22]. Alignment technique requires reference sequences to predict mutations that occur in the patient's DNA sequence and requires a long time to carry out the prediction process. Several studies exist that have proposed machine learning-based mutation detection systems [23]- [25]. The problem in these studies is that the model built only detects the type of mutation without its index or the model that is built still requires other data besides the patient's DNA sequence or additional tools, so that if there is only a patient's DNA sequence, the mutation detection process becomes constrained.
Based on these problems, this study proposes a new approach to detect type and index mutation namely join classifier for type and index mutation detection using sequential labeling model. The methods used are 1D-CNN, BiLSTM, and Bi-GRU to get the best detection system. The types of mutations detected include Single Nucleotide Variant (SNV)/substitution, insertion, deletion, and delins (deletion insertion), while the index of the mutation is the index/point where the mutation occurs in the DNA sequence. Substitution are nucleotide changes that occur at a certain point without changing the length of the sequence, insertions are the addition of nucleotides in the DNA sequence, deletions are a reduction in nucleotides in the DNA sequence, while in delins insertion and deletion mutations occur simultaneously at a certain point. In insertion, deletion, and delins there occurs a change in the length of the DNA sequence. Ten genes sequences in lung cancer that have the most mutations including EGFR, TP53, KRAS, CTNNB1, SMARCA4, CDKN2A, PTPRD, BRAF, ERBB2, and PTPRT, and its mutations from the public database, namely the Catalog of Somatic Mutation in Cancer (COSMIC) [26], are tested in this study.
This study contributes through the sequential labelling model with the simple BiLSTM and Bi-GRU architecture which is effective in detecting four mutations types (SNV, insertion, deletion, and delins) and their mutations index that occur in the ten genes of lung cancer DNA sequence. BiLSTM and Bi-GRU models with simple architectures will potentially have faster training and testing times than models with larger architectures, so the proposed model can detect types and index mutation within average 0.0105 seconds for a single sequence. The sequential labelling model scans nucleotide in a DNA sequence and can classify based on whether the mutation occurs and consequently identifies its mutation index. The model can is capable of detecting several types and index mutations at once from one DNA sequence. This is different from the usual classification model which only classifies one whole sequence to a certain label, without detecting which nucleotides are mutated.
Furthermore, the proposed method can later be used to calculate the number of mutations that occur in one sequence which can be used to determine the mutation rate in certain diseases. Index mutation detection is also useful for determining new mutations that occur in cancer, other diseases, or new virus variation, through comparison of index mutations between patients. The sequential labelling concept works in the same way as the alignment technique, which checks each nucleotide in a sequence, but the proposed model has a much faster detection time and does not need reference sequence. The proposed model also only requires DNA sequences to be detected, without the need for other data or tools to detect the type and index of mutations. This will greatly facilitate the user in detecting mutations using the proposed model because the user only needs to enter the DNA sequence.

II. MATERIAL AND METHODS
The proposed sequential labelling model for detecting the type and index simultaneously of genetic mutations in the DNA sequence data uses 1D-CNN, BiLSTM, and Bi-GRU model. Data sequences, from the DNA sequence of ten genes in lung cancer that have the most mutations, including EGFR, TP53, KRAS, CTNNB1, SMARCA4, CDKN2A, PTPRD, BRAF, ERBB2, and PTPRT, are selected in this study to test the efficacy of the model. This section presents the detailed steps including preprocessing, data division into training data, validation, and testing, as well as the design and implementation of sequential labelling models using 1D-CNN, BiLSTM, and Bi-GRU in detecting the type and index mutations. Fig. 1 presents the workflow diagram of the proposed pipeline and the method for detecting the type and index of mutations in DNA sequences.

A. ACQUISITION AND PREPROCESSING DATA
The data used in this study is DNA sequence data from the genes (EGFR, TP53, KRAS, CTNNB1, SMARCA4, CDKN2A, PTPRD, BRAF, ERBB2, and PTPRT gene), which have been reported to display mutations in lung cancer cases [27]- [33], VOLUME 10, 2022 was obtained from a public database COSMIC (Catalogue of Somatic Mutation in Cancer) [26]. Each gene has several reference gene transcripts that have different gene lengths. Gene length states the number of nucleotides in one gene sequence. The acquired data consisted of two parts, namely reference DNA sequence data and mutation target data. DNA sequence data consisted of nucleotides A, C, T, and G. The second type of data used is mutation target data (mutation call) which contains gene names, sequence transcripts, patient sample ID, AA mutation (protein mutation), CDS mutation (type and index of DNA mutation), primary tissue, and others.
There exist several types of mutations, namely substitution (SNV), insertion, deletion, delins (deletion insertion), and duplicates. Duplicate mutations are combined with insertion mutations because they both have an additional number of nucleotides at a certain index. If there was a mutation record for which the type and index of the mutation is unknown, the mutation record is deleted from the dataset. The patient sequence data is generated by mapping between the corresponding reference sequences and the mutations that occur in the mutation target file based on the unique patient sample ID and gene transcript. The preprocessed patient sequence data is stored in a csv file.
Conversion of DNA sequence data in the form of strings (nucleotides A, C, T, and G) into numerical representation, is required, because the proposed model require numeric values as input. The DNA mapping techniques used in this study are integer mapping and Voss mapping. Sequence data was converted to integer representation using Equation 1 [34], with 0 being used as sequence padding to equalize the length of the DNA sequence. Furthermore, Voss mapping is used to convert integer sequences into one hot representation on the embedding layer using Equation 2-5 [35], [36].
with X = input DNA sequence, X = integer sequence, X 1 , X 2 , X 3 , X 4 = Voss mapping results, i = nucleotide index, A = adenine, T = thymine. G = guanine, and C = cytosine. The proposed model is the sequential labelling model which is widely used in Natural Language Processing. In the model, one nucleotide will have one label, so the mutation target originating from the mutation call file needs to be converted to a numeric sequence with the same size as that of the input sequence. In this study, SNV/substitution mutations were converted to a value of ''1,'' a value of ''2'' for insertion and duplicate mutations, a value of ''3'' for a deletion mutation, a value of ''4'' for a delins mutation, ). The value ''0'' is also used as padding if the length of the target sequence is less than the maximum length of all the target sequence. For example, if there is a snippet of the following sequence ''ATGGCCATCC,'' insertions occurring in nucleotides with indexes 8 and 9, and substitution mutations in nucleotides with index 10, it will produce numerical sequence inputs and numerical sequence targets as shown in Table 1.
with Y = mutation type, Y = target sequence, and i = index.
The numerical sequences are then reshaped/subset into sequences with shorter lengths using sliding window approach with two schemes, namely ''with overlap'' and ''without overlap.'' The two reshape schemes use window sizes (length of sub sequences) of 50, 100, and 150. For schemes with overlapping sliding windows, the sliding window shifts with stride sizes of 25 and 50, while the scheme without overlapping, sliding window shifts to window size so that there is no overlap between sub sequences. The reshape process is carried out on the numeric input sequence and the numeric target sequence. An example of a sequence reshape process using a sliding window is presented in Fig. 2.
In this study, the Random Under Sampling (RUS) technique was also used to handle imbalanced data. The number of nucleotides in the available data that were not mutated was much higher than the ones that had mutations. The balance of the data can affect the pattern learned by the Deep Neural Network (DNN), so that RUS used for training data will later be used for the DNN training process. The sampling process begins with counting the number of sub-sequences that  contain mutations and the ones that do not contain mutations. The RUS technique is carried out by randomly deleting sub-sequences that do not contain mutations (data with more numbers) so that the number of sub-sequences that do not contain mutations is balanced with sub-sequences containing mutations. Fig. 3 shows the distribution of preprocessed data with a scheme without overlap on EGFR gene, and Fig. 4 shows the distribution of preprocessed data with an overlapping scheme on EGFR gene too. Fig. 3 and Fig. 4 show that the overlapping sequence reshape scheme produces more sub-sequences, so that the sub-sequences that will be learned by 1D-CNN, BiLSTM, and Bi-GRU will be more varied.

B. JOIN CLASSIFIER OF TYPE AND INDEX MUTATION USING SEQUENTIAL LABELING MODEL
The proposed model in this study to detect the type and index mutations simultaneously in DNA sequences is join classifier using sequential labeling model with 1D-CNN, BiLSTM, and Bi-GRU. In the sequential labeling model built, each nucleotide in the sub sequence will be labeled ''1'' if there is a substitution mutation, ''2'' if there is an insertion or duplicate mutation, ''3'' if there is a deletion mutation, ''4'' if there is a delins mutation, or ''0'' if no mutation occurs (normal). The type and index detection model using DNN requires a training and testing process, so the available data is also divided into training data and testing data. Then, the training data is divided into training data and validation data. The training data is used to train the DNN model, and the validation data is used to calculate the accuracy of the system in the training process. Validation data also serves to avoid overfitting, i.e., the resulting model is very good if used on training data but has low accuracy on test data. Test data is used to measure the accuracy of the model in the testing process when the training process has been completed.
90% of the sequence data on each gene transcript was used for training and the rest 10% for testing (Fig. 5). The training process aims to find the optimal hyperparameters. A 5-fold cross-validation was conducted to gauge the performance of the trained model. In each iteration, one part of the data will be used as the validation data, while the rest 4 parts were used as the training data. Accordingly, 5 iterations per experiment were conducted. Table 2 shows the number of patient sequences and the number of mutations resulting from preprocessing and the distribution of normal and mutated nucleotides in the training and testing data. As presented in Table 2, the number of preprocessed sequences samples and their mutations is very limited, especially for CTNNB1, SMARCA4, CDKN2A, PTPRD, BRAF, ERBB2, and PTPRT gene, not all genes have insertion, deletion, and delins mutations, and the number of normal nucleotides is much more than the mutated ones.
The first method used is 1D-CNN. 1D-CNN is a variation of the Convolutional Neural Network (CNN) where the kernel used will shift in one dimension. 1D-CNN has been widely used to solve many cases on one-dimensional signals, including monitoring health structures, classification of biomedical data and early diagnosis, detection of anomalies and identification in power electronics [37]. In this study, the proposed 1D-CNN has the following architecture: -One embedding layer to change the integer sequence representation to one hot representation using Voss Mapping (Equation 2-5). -N layers of one-dimensional convolution, in which the calculation of the output is done by performing dot product operations between all filters/kernels and the inputs at that layer (Equation 7). The number of layers used are 2 and 4 convolution layers which have 128 kernels in the first layer and 256 kernels in the next layer, kernel size 3 in the first layer and 5 in the next layer, the value of strides is 1 in the convolution process, and the activation function of Rectified Linear Units (ReLU) (Equation 8) [38].
-Fully Connected Layer/Dense Layer, is an ordinary Neural Network layer of which function is to classify the previous input layer. This layer will calculate the score for each class and have a one-dimensional output that is sized according to the number of classes. In this research, dense layer used is Time Distributed Layer because the model used is sequential labeling. The Time Distributed Layer will produce the number of outputs according to the number of inputs, which means that one nucleotide will have one output in the normal form or in the type of mutation if a mutation occurs. This layer uses the SoftMax activation function using Equation 9 [39].
with σ = activation function, and z = input value. The training algorithm used to train the 1D-CNN architecture is the Backpropagation algorithm optimizing Adam algorithm (Adaptive Moment) with an adaptive learning rate [40] to accelerate convergence. The initial learning rate value used is quite small, namely 0.0001. The second method proposed is Bidirectional Long Short-Term Memory (BiLSTM) which is one of the methods in Recurrent Neural Network (RNN). BiLSTM consists of two Long Short-Term Memory (LSTM), in which one LSTM processes input in a forward direction and the other LSTM processes input in a backward direction. The two LSTM outputs will be combined and entered in the next layer [41]. One LSTconsists of input gates (i t ), forget gates (f t ), output gates (o t ), cell states (c t ), and cell output (h t ) [42]. The input gate processes the previous cell's input and output vectors which will be stored in the cell states (Equation 10). The forget gate determines how many cell states in the previous state are passed to the calculation of the output cells (Equation 11). Output gates determine how much information in the cell state is passed to the output cell (Equation 12). The input gate, forget gate, and output gate are fully connected layers, the cell state is a memory cell (Equation 13), and the output cell is the output of the LSTM network (Equation 14).
with x = input, W = weight, b = bias, t = timestep, σ = sigmoid activation function. The proposed BiLSTM model consists of three layers, namely one embedding layer, one or two BiLSTM layer, and one-time distributed layer. In the embedding layer, Voss mapping is used as in the 1D-CNN model to change the integer data representation into one hot encoding. The BiLSTM layer used has 128 and 256 observed LSTM units. Then the other parameters used are tanh and sigmoid activation functions for recurrent activation, soft-max activation function in time distributed layer, dropout values 0 and 0.2, learning rate 0.0001, and Adam's optimization algorithm.
The last method use is Bi-GRU. Like BiLSTM, the Bi-GRU architecture also consists of two Gated Recurrent Unit (GRU), in which one GRU processes input in a forward direction and the other GRU processes input in a backward direction. GRU is a variation of LSTM with a simpler architecture. GRU consists of an update gate, reset gate, candidate hidden state, and hidden state. The update gate determines how much information from the previous time step will be passed on to the next iteration (z t ), while the reset VOLUME 10, 2022 gate determines how much information from the previous time step will be deleted (r t ). The calculation results from the reset gate will be used in the calculation of the candidate hidden states (h t ), and the results of the calculation of the update gate and the candidate hidden state are used in the calculation of the hidden state (h t ) [9], [43].
with W and U are the weights to be learned, σ is sigmoid activation function, and is Hadamard product. The proposed Bi-GRU model also consists of three layers, namely one embedding layer, one or two Bi-GRU layers, and one-time distributed layer. In the embedding layer, Voss mapping is also used to change the integer data representation into one hot encoding. The Bi-GRU layer used has 128 and 256 observed GRU units. Then the other parameters used are tanh and sigmoid activation functions for recurrent activation, soft-max activation function in time distributed layer, dropout values 0 and 0.2, learning rate 0.0001, and Adam's optimization algorithm.

C. EXPERIMENTAL SCENARIO
The 1D-CNN, BiLSTM, and Bi-GRU model were initially trained using the EGFR gene training data, because the EGFR data has the most complete mutation compared to other genes, which helps obtaining the optimal weights in detecting the type and index mutation, along with the optimal architecture and hyperparameters for each method. A 5-fold cross validation technique was used to evaluate the model performance and ensure no overfitting. The observed hyperparameters included window size (length of sub sequence) and stride for reshape sequence process, data sampling using Random Under Sampling, number of 1D-CNN layer, number of LSTM or GRU units, number of BiLSTM or Bi-GRU layer, and the dropout value. The detailed value of each observed hyperparameter will be explained in the parameter observation section of each method used. Each method is trained using Adam''s optimization algorithm with a learning rate of 0.0001 and number of epochs of 100. The selection of the best hyperparameter and architecture for each model is based on the F1-score value using the validation data. Finally, The performance results of the 1D-CNN, BiLSTM, and Bi-GRU models will be compared, and used to train and test the type and index mutation using nine other genes, namely TP53, KRAS, CTNNB1, SMARCA4, CDKN2A, PTPRD, BRAF, ERBB2, and PTPRT genes.

III. RESULTS AND DISCUSSION
Observations were made to test the performance level of mutation type and index detection using 1D-CNN, BiLSTM, and Bi-GRU sequential labeling model on ten genes in the lung cancer dataset based on training and validation loss in the training process, running time (training and testing time) in seconds, as well as precision, recall, and F1-score [44], [45] of the test data. The types of mutations detected were SNV/substitution, insertion, deletion, and deletion insertion (delins), while the index mutation stated the nucleotide index in the DNA sequence to be processed. The tests carried out included observations of preprocessing data, observations of 1D-CNN and BiLSTM hyperparameters in detecting the type and index mutations and their performance, and Bi-GRU will use the optimal architecture and hyperparameters obtained by BiLSTM. Then, each 1D-CNN, BiLSTM, and Bi-GRU with the best hyperparameters is retrained by adding the number of epochs, the number of training data and genes, or the number of parameters in the neural network architecture. The proposed model will be compared with the performance of the well-known bioinformatics tools, namely BLAST, in detecting the type and index of mutations.

A. 1D-CNN PARAMETER OBSERVATION AND PERFORMANCE
In this section, we observe the effect of 1D-CNN parameters and data preprocessing on the performance of type and index mutation detection using 1D-CNN. Observed parameters include: • Dataset: EGFR gene.  Table 3, training and validation loss did not have a significant difference in each combination of window size and stride parameters. Overall, sequence reshape with overlap scheme (using stride) has better performance than without overlap. This is because sequence reshape with overlap produces more data and is more varied, so 1D-CNN can learn data patterns better. Observations using the sampling technique, namely RUS, were carried out to determine the effect of RUS on the performance of detection of types and index mutation using 1D-CNN. The scheme without using RUS has a smaller training and validation loss but a higher average F1-score than that of the scheme using RUS, for the validation data. This shows that there is an overfit in the scheme without using RUS, which can be caused by the number of normal nucleotides being much higher than the mutated nucleotides, so that the trained model tends to lead to normal nucleotides. Therefore, when calculating using F1-score for each type of mutation, the scheme without using RUS has a smaller F1-Score value.
Furthermore, the effect of number of 1D-CNN layer was observed on the performance of types and index mutation detection. The number of 1D-CNN layers observed were 2 and 4 layers. The training and validation loss achieved in the 1D-CNN model using 4 layers is lower and have higher average F1-score than the 1D-CNN model with 2 layers. Meanwhile, the dropout value in the 1D-CNN model does not have a big influence on training and validation loss, and the F1-score value of the validation data.
The performance of 1D-CNN in detecting the type and index mutations is very unstable and quite dependent on the data and hyperparameters used. It requires more variation of data in the training process to study the pattern of mutations that occur. In Table 2, the number of mutations in the training data of EGFR gene is quite large, namely 11,196 SNVs and 6,070 deletions, so the 1D-CNN model can detect them well. Figure 7 shows the comparison of training and validation loss for each combination of hyperparameters in detecting the type and index mutation. The best average F1-score of validation data was achieved with window size 50 and stride 25, 4-layers 1D-CNN, and with RUS, namely 0.9969 (SNV), 0.7847 (insertion), 0.9905 (deletion), and 1 (delins).

B. BILSTM PARAMETER OBSERVATION AND PERFORMANCE
In this section, we observe the effect of preprocessing data and BiLSTM parameters on the performance of detection VOLUME 10, 2022    Table 4, like 1D-CNN, the reshape sequence scheme with overlap has a better F1-score validation than the reshape scheme without using overlap and has smaller training and validation loss. In observing the effect of using sampling data, namely RUS, the value of training loss and validation loss in the scheme without RUS is better than the scheme with RUS. However, RUS can increase the F1-score in the mutation detection during. This shows that BiLSTM can learn data patterns if the data has a balanced amount in each class even though the total amount of data is smaller.
The next observation was to determine the effect of LSTM unit numbers on the performance of the type and index mutation detection model using BiLSTM. The numbers of LSTM unit used are 128 and 256. The architecture with 256 LSTM units displayed better training and validation loss, could reach the convergence point faster, and was able to predict all types of mutation better than the architecture with 128 LSTM units. This is because models with more LSTM units can learn complex data better. Furthermore, using a dropout value of 0.2 can improve detection performance on the BiLSTM  Furthermore, increasing the number of BiLSTM layers in the model with window size 100; stride 25 and window size 150; stride 50 with 256 LSTM units, using RUS and dropout, would improve the system performance. The BiLSTM model with two layers has a better validation performance than the one-layer model. With similar number of LSTM units, a larger number of layers can learn more complex data, thereby increasing performance. However, the greater the number of layers and LSTM units used, it can cause overfitting so that the use of dropout is very necessary. The best performance of type and index mutation detection using BiLSTM was achieved when the number of VOLUME 10, 2022

C. BI-GRU PARAMETER OBSERVATION AND PERFORMANCE
In this section, we observe the effect of preprocessing data and parameters on the performance of detection of types and -Dropout value: 0 (without dropout) and 0.2. Similar to 1D-CNN and Bi-LSTM, the model built using Bi-GRU also has better performance when using a reshape sequence scheme with overlap. And RUS can also improve VOLUME 10, 2022 validation performance when compared to the scheme without using RUS. This proves that the detection model that is built requires a large data thus leading to the amount of data in each nucleotide class to be more balanced. The use of a larger number of layers and the number of GRU units can also improve detection performance. As shown in Figure 9, the model with two layers of Bi-GRU has a higher level of convergence when compared to other models.
Bi-GRU, with window size of 150; stride 50 and with a window size of 100; stride 25 demonstrate the same high average F1-score validation (0.9957), where both models use RUS, two layers of Bi-GRU, 256 GRU units, and dropout 0.2. In this study, the best model was selected based on the F1-score of the validation data, namely the model with a window size of 150; stride 50 because this model has advantages in detecting SNV and insertions, and only has a smaller F1-score of detecting deletions when compared to the model with window size 100; stride 25. And the model with a window size of 150; stride 50 has a smaller training and validation loss. The best model using Bi-GRU can achieve F1-score validation of 0.9980 (SNV), 0.9961 (insertion), 0.9887 (deletion), and 1 (delins).

D. PERFORMANCE COMPARATION OF THE PROPOSED MODEL ON EACH GENE
In this section, the best models with its architecture and hyperparameters will be tested using EGFR gene test data. The proposed model is also compared with the BLAST pairwise alignment for EGFR gene, to check the strength of the proposed model. BLAST is one of the well-known bioinformatics tools and is often used for sequence prediction, sequence alignment, and others [46], [47]. To detect mutations in DNA sequences using BLAST, the tested sequences are first aligned to the reference sequence, then the type and mutation index are obtained by manually inferring the alignment results.
Based on the tests that have been carried out on the EGFR gene using the proposed model (BiLSTM, Bi-GRU, and 1D-CNN) and the alignment technique using BLAST, BiLSTM and Bi-GRU can achieve high performance of type and index mutation detection, namely 0.9271 (precision), 0.9953 (recall), and 0.9553 (F1-score) for BiLSTM and 0.9264 (precision), 0.9975 (recall), and 0.9561 (F1-score) for Bi-GRU. Meanwhile, the performance of 1D-CNN is 0.9989 (precision), 0.8857 (recall), and 0.9319 (F1-score), while the BLAST performance is 0.8773 (precision), 0.8741 (recall), 0.8757 (F1-score) (Fig. 10). BLAST alignment is very accurate in detecting substitution mutations, but it is prone to errors for detecting insertions and deletions because there is a nucleotide shift. The mutation index detection using BLAST is given a tolerance of 5 bp from the actual mutation index to deal with the problem of sequence shifts when insertion and deletion mutations occur, while the proposed model predicts the mutation index using the exact match method and is not given tolerance for the predicted results. So, based on the test results obtained, the proposed model is superior in detecting the index of insertion and deletion mutations, and the F1-Score for detecting SNV mutation types only differs by 0.0042 against the BLAST's F1-score.
In addition to the BLAST tool, several researchers have also conducted research in detecting index mutation. Zuo et al. conducted a study to detect position index mutations using Feedback Fast Learning Neural Network [48] on different data, but the system performance was calculated based on the number of mutations detected by the model built. Chen and Xie used the PCR matching method to detect mutations in exon data. PCR matching can achieve an accuracy of 97.26% with a detection time of 96 seconds [49]. In comparison to previous studies, the proposed model is quite promising to be applied to other DNA sequence data, considering its performance. The proposed model can detect insertion and deletion mutation index better than BLAST pairwise alignment because the proposed model can  study mutation data patterns according to the training data provided. Also, the proposed model can detect several types of mutations and their indexes in one DNA sequence because it uses a sequential labeling model, where one nucleotide will be labeled either normal or a mutation specifying its type. Furthermore, the proposed model uses data that has a mutation label so that the calculation of the model's performance can be done by calculating precision, recall, and F1-score from the predicted type and index mutation.
Furthermore, the sequential labeling model proposed using BiLSTM and Bi-GRU has better performance than using 1D-CNN. 1D-CNN has higher precision, but the recall value is far below BiLSTM and Bi-GRU, so the resulting F1-score is smaller. Then, when reviewed in Fig, 10, the recall value generated by 1D-CNN is unstable, which reaches a recall value of 0.6354 on the detection of insertion mutations. Therefore, the next test will use a sequential labeling model VOLUME 10, 2022   with BiLSTM and Bi-GRU on the genes TP53, KRAS, CTNNB1, SMARCA4, CDKN2A, PTPRD, BRAF, ERBB2, and PTPRT, to find out how robust the proposed model has been built. Table 6 presents the testing performance comparation of the best model of the proposed method, namely sequential labeling with BiLSTM and Bi-GRU on ten genes in lung cancer. Based on the table, the proposed method is very good at detecting the type and index mutations in each gene even though the type and number of mutations in each gene are different. Bi-GRU succeeded in achieving an average precision of 0.9728, recall of 0.9508, and an F1-score of 0.9596, with an average detection time of 0.0164 seconds for one sequence, and BiLSTM achieve higher performance namely average precision of 0.9771, recall of 0.9497, and an F1-score of 0.9612, with an average detection time of 0.0243 seconds for one sequence. This proves that the proposed method is robust in detecting the type and index mutation even though the types of genes used are different, and each gene has a different number of samples and the number of mutations. BiLSTM is superior in detecting the type and index of mutations in the ERBB2 and PTPRT genes, while Bi-GRU is superior in the EGFR, TP53, KRAS, CTNNB1, SMARCA4, CDKN2A, PTPRD, and BRAF genes. Fig. 11 shows the confusion matrix of the type and index mutation detection in each gene using BiLSTM. The confusion matrix shows the number of mutations in each mutation type and each gene. As well as how many mutations can be detected correctly and mutations that are still misdetected. In the EGFR gene, errors of detection occurred in normal nucleotides which were detected as insertion and deletion mutations, errors in detection of insertions and deletions in the TP53 gene, and errors in SNV detection in the PTPRD and PTPRT genes. As for the other genes, the error of detection that occurs is very small, namely below ten nucleotides in each type of mutation and gene.
As shown in Table 6, the performance of BiLSTM and Bi-GRU is not much different even though the average F1-score of BiLSTM is higher than Bi-GRU. Therefore, a t-test was also conducted on EGFR dataset to test how significant the difference in performance was between BiLSTM and Bi-GRU. The t-test was carried out using the 5 × 2 cross validation method, where the EGFR dataset was divided into two equal parts, namely the training and testing set for five iterations, and performed on the best BiLSTM and BiGRU models. From the 5 × 2 CV process, 10 F1-score testing values were obtained which were then used to calculate the mean and standard deviation for the detection performance results of the BiLSTM and Bi-GRU models and a t-test was conducted to test the difference in performance of the two models was significant or not. In Figure 12, BiLSTM and Bi-GRU have similar average F1-scores, and very small standard deviations of 0.0107 for BiLSTM and 0.0083 for Bi-GRU. This proves that BiLSTM and Bi-GRU have stable performance even though the part of the EGFR dataset usage varies. Furthermore, the resulting p-value from t-test was 0.5046 (>0.05), concluding that the performance of BiLSTM and Bi-GRU was not significantly different. Fig. 13 and Fig. 14 show examples of type and index mutation detection outputs using the proposed model namely sequential labeling model using BiLSTM. In Fig. 9, the mutation type detected in the test sequence is a SNV mutation with an index of 2477 and when viewed from the confusion matrix the detection was carried out correctly. While in Fig. 10, the mutation type detected is an insertion mutation with an index of 2187-2192, in the confusion matrix, all index insertion mutations detected correctly, but there is one normal nucleotide that detected as deletion mutation.
For future research, it is planned to develop the proposed sequential labeling model to detect type and index mutations in cancer types or other diseases or diseases caused by viruses. The use of other deep learning models, oversampling technique, and data augmentation will also be our future research. In the mutation data of DNA sequence, further studies are needed for the oversampling method and data.

IV. CONCLUSION
In this work, the detection of the type and index mutations on DNA sequence from lung cancer cases were carried out using sequential labelling model to detect the type and index mutations simultaneously using 1D-CNN, BiLSTM, and Bi-GRU. The data used is DNA sequence data of EGFR, TP53, KRAS, CTNNB1, SMARCA4, CDKN2A, PTPRD, BRAF, ERBB2, and PTPRT genes, that is known to display many mutations in lung cancer cases, which were obtained from COSMIC. Based on the findings, the sequential labeling model proposed using BiLSTM and Bi-GRU has better performance and more stable than using 1D-CNN. BiLSTM and Bi-GRU also achieved high performance proving that the proposed method is robust in detecting the type and index mutation across different genes. Furthermore, based on the findings, the proposed model performed better than BLAST in detecting the insertion and deletion mutation and the accuracy of SNV mutation detection is only slightly different compared to that of BLAST. Our model directly detected mutations using a previously trained model, without re-aligning it to the reference sequence. The proposed model only requires a test DNA sequence and does not require other data and supporting tools to detect the type and index mutations. Based on the results obtained, the proposed model is quite promising to be applied to detect the type and index mutations in DNA sequences for other cancers and other diseases.