A Deep Bidirectional LSTM-GRU Network Model for Automated Ciphertext Classification

Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) are a class of Recurrent Neural Networks (RNN) suitable for sequential data processing. Bidirectional LSTM (BLSTM) enables a better understanding of context by learning the future time steps in a bidirectional manner. Moreover, GRU deploys reset and update gates in the hidden layer, which is computationally more efficient than a conventional LSTM. This paper proposes an efficient network model based on hybrid BLSTM-GRU for ciphertext classification aiming to mark the category to which the ciphertext belongs. The proposed model performance was evaluated using well-known evaluation metrics on two publicly available datasets encrypted with various classical cipher methods and performance was compared against one-dimensional convolutional neural network (1D-CNN) and various other deep learning-based approaches. The experimental results showed that the BLSTM-GRU cell unit network model achieved a high classification accuracy of up to 95.8%. To the best of our knowledge, this is the first time an RNN-based model has been applied for the ciphertext classification.


I. INTRODUCTION
With the increasing rate of data transfer over the internet, system security is becoming one of the most important issues for information exchange [1], [2]. The system security can be subdivided into cryptography and cryptanalysis. The purpose of a cryptosystem is to provide security by decorrelating the plaintexts and ciphertexts and making the plaintexts unreadable [3]- [5]. Cryptanalysis investigates the weakness of the cryptosystem to ensure system security. In cryptanalysis, the attacker tries to recover the original form of a secured message by analyzing hidden patterns in data or finding the secret key. One of the main ways to attack the cryptosystem is to analyze the hidden data patterns to reveal the main information of the ciphertext [1], [6], [7]. This is referred to as building a structured knowledge representation by extracting the features from the ciphertext normally using machine learning algorithms to process the information confined in the ciphertext. This technique includes machine learning, statistics, computational linguistics, and information retrieval [8]. Another way is to find the secret key to recover the original message. Ciphertext classification aiming to mark the category to which the ciphertext belongs can help an attacker reveal the subject of the information being exchanged. Ciphertext classification is a supervised learning task, where the machine learning algorithms are trained with a set of labeled data from different classes using features extracted from the document. On the other hand, different types of information with distinctive features can be involved in classification tasks.
An artificial neural network (ANN) is a commonly used method for many recognition, classification tasks, and cryptanalysis [9]- [11]. Many ANN-based and evolutionary approaches have been used in literature for cryptanalysis [12]- [17]. Convolutional neural networks (CNNs) are a powerful machine learning approach introduced several years ago. Recent advances in CNNs have demonstrated remarkable performance in different data processing tasks applied mainly to medical image analysis [18], [19]. By network training on a set of annotated data, CNNs can extract the hidden pattern in data with remarkable accuracy. On the VOLUME XX, 2017 1 other hand, ANN and CNN-based methods are limited by the lack of sequential data processing ability considering unique characteristics of the ciphertext [20]- [22]. The long shortterm memory (LSTM), recurrent neural network (RNN), and gated recurrent unit (GRU) cell units have been established for sequential data processing, which takes temporal features into account using memory cell units [20], [23]- [25]. Because the ciphertext sequence is decorrelated from the original plaintext, finding discriminative features between different categories requires a deep understanding of the context [26]. The bidirectional LSTM (BLSTM) provides a better understanding of context by learning in a bidirectional manner and learning the representations from future time steps [27], [28]. On the other hand, the GRU cell unit is computationally more efficient than conventional LSTM by deploying update and reset gates in hidden layers [29]. In our previous study, an attention-based LSTM was proposed to attack the classical cipher [30]. The attention mechanism improves the LSTM ability to save important information along the sequence.
Realizing the lack of deep learning applications on the ciphertext classification, this article presents a novel hybrid network model using BLSTM and GRU cell units to classify the ciphertext. The proposed network effectively captures time dependencies of the feature and the text features, which are essential for efficient cipher text classification. The present research evaluated three classical cipher methods (Caesar cipher, Vigenere cipher, and Substitution cipher). The efficiency of the proposed model was assessed using well-known evaluation metrics (accuracy, recall, precision, and F1 score) on two datasets: publicly available Brown corpus and company report datasets. The experimental results showed that the proposed model could effectively and quickly confirm the ciphertext category and achieve a high classification accuracy of 95.8%. In addition, we proposed a 1D CNN network model to evaluate the propsed network efficiency against the CNN model.
The main contributions according to the security and automated ciphertext classification challenges are as follows: (1) A hybrid network model was designed based on BLSTM-GRU cell units for automated ciphertext classification, supporting different ciphertext lengths.
(2) In addition to features in the ciphertext document, this study focused on the temporal dependency from the input sequence using RNN-based cell units.
(3) The proposed method efficiency compared to various deep learning-based models was explored in the paper. The results showed that the proposed network model outperformed other deep learning-based approaches and was more efficient in automated ciphertext classification.
The remaining sections of this article are planned as follows. Section II briefly reviews RNN, LSTM, GRU, and classical cipher methods used in this experiment. Details of the proposed hybrid BLSTM-GRU and 1D CNN network models are presented in Section III. Section IV presents details of the experiments followed by results and discussions. Concluding remarks are given in the last section.

II. METHODOLOGY
The current section in the article presents the details about the RNN, LSTM, and GRU cells units used to design the main constructing elements of the network model, followed by a review of different classical cipher techniques. This review is restricted to the above-mentioned state-of-the-art techniques because the primary focus was on ciphertext classification.

A. RECURRENT NEURAL NETWORKS (RNNs)
The RNN is a widely used specific neural network capable of sequence data processing that makes it suitable for learning algorithmic tasks. The RNN has been used for many natural language processing (NLP) applications. The main limitation of the RNN includes suffering from vanishing gradients in deep networks (see Fig. 1(a)) [31], [32]. For sequence data (x1, x2, x3, …, xt), the hidden state ht of the RNN is calculated using the following equation: where f denotes the activation function.

B. LONG SHORT-TERM MEMORY (LSTM)
LSTM is a specific type of RNN model to solve the gradient vanishing problem of the RNN. LSTM is made up of three main gates. These three gates control the information flow in and out from LSTM structures to protect and control information: the forget gate, the input gate, and the output gate (see Fig. 1(b)) [31], [33]. The input gate stands for new information added to the cell state, the forget gate decides which information will be memorized or eliminated from the cell, and the output gate is for LSTM output. Sigmoid and tangent functions are mainly used in LSTM cells. , where σ is the sigmoid and tanh is the tangent activation function, respectively. i, f, o, c, and h are the input gate, forget gate, output gate, intermediate gate, and the cell memory output and  denotes element-wise multiplication; t represents time step and T represents the length of the window (the length of a sliding cutout of a time sequence of data); w denotes the layer weight representing input x, and b represents the threshold of the output gate.

C. GATED RECURRENT UNIT (GRU)
The GRU is a type of RNN structure with fewer gates compared to LSTM. In the GRU cell unit, the input and forget gates are controlled by one gate. Hence, the forget gate and input gate are combined into one gate, making the GRU simpler than LSTM [29]. For example, if zt = 1, the entry of the new data for the input gate will be closed, and the forget gate is opened, whereas the mechanism acts vice versa when zt = 0 (see Fig. 1(c)). The reset gate determines how to combine the new input with the previous memory to calculate the new state. The GRU differences from the LSTM are as follows: where rt stands for reset gate; zt represents the update gate; ot is the output gate.  denotes element-wise multiplication; t represents the time step; T represents the length of the window; w denotes the layer weight representing input x, and b represents the threshold of the output gate.

D. CLASSICAL CIPHERS
The classical ciphers used in this experiment, including Caesar cipher, Vigenere cipher, and substitution cipher, are explained briefly below. To encrypt the original plaintext into unreadable ciphertext with a shift (or Caesar) cipher encryption method, each letter in the original message is replaced with a letter corresponding to a certain number of letters up or down in the alphabet. The number of possible shifts is limited to between 0 and 25 in the English language, which is equal to the number of English letters. The receiver decodes the ciphertext message by shifting each letter in the encrypted message back [34]. A Vigenere cipher is categorized as a poly-alphabetic cipher that encrypts a plaintext letter into a set of different letters using the Key with the total number of possible 26m keys. The substitution cipher deploys any permutation of the 26 letters as a key. Therefore, the total possible keys are 26! ≈ 2 88. 4 . Table 1 gives an example of the classical cipher methods.

E. WORD EMBEDDING
Word embedding is a set of language feature learning techniques in NLP converting word tokens to machinereadable vectors. Word2vec is a two-layer neural net that converts the text words into a vector. The input is a text corpus, and the output is a set of vectors. The advantage of word2vec is that it can train large-scale corpora to produce low-dimensional word vectors [35]. Given a sentence consisting of n words (x1, x2, x3, ..., xn-2, xn-1, xn), every word xi is converted into a real-valued vector, ei, represented as where w is a word, and d is the size of the word embedding.

III. NETWORK ARCHITECTURE DESIGN
This section provides details of the proposed hybrid BLSTM-GRU network model and 1D CNN-based network model.

A. PROPOSED HYBRID BLSTM-GRU NETWORK MODEL
First, the proposed network was tested using the LSTM network with three layers, and the results were evaluated. The parameter setting for each LSTM layer was selected experimentally. Subsequently, the LSTM layers were replaced with BLSTM and GRU cell units, and the network performance was evaluated. Table 2 lists the optimal hyperparameters setting of the proposed network model. Fig 2 shows the overall structure of the proposed network model. The input layer of the proposed hybrid network model is a sequence of the ciphertext. The input layer is a sequence input layer to enter sequential ciphertext data into the network, followed by a word-embedding layer. The next is the BLSTM layer, followed by a dropout layer to prevent network overfitting. The BLSTM layer learns the dependencies and dynamics between sequence data in a bidirectional manner, which is important for learning discriminative features of data in each time step. The dropout layer is normally used in a deep learning-based method to prevent the network from overfitting. The dropout layer randomly drops out a certain number of neurons to improve the generalizability of the network. This prevents network overfitting. The second layer of the proposed model is a GRU cell unit with 200 hidden cell units that can extract contextual features with a lower computational cost than LSTM, followed by a dropout layer. Afterward, a conventional LSTM unit is used with 200 hidden units followed by a fully connected with 60 neurons. The last layer is a fully-connected layer (FC) with the number of neurons equal to the number of classes for each dataset. A Softmax function is used to generate the probability of each ciphertext class. The proposed method can fully characterize each ciphertext information based on the advantages of the high precision sequence labeling ability of the network model. The softmax function can be calculated as follows: where z stands for input vector and k is the number of classes; N denotes the total number of samples; i is the sample number.   The proposed model-training algorithm can be explained in the following steps.
Algorithm: Proposed BLSTM-GRU network model following elements: 1) a set of convolutional filters, 2) an activation function, and 3) a max-pooling layer. A convolution layer is a fundamental component of the CNN architecture that performs feature extraction, which is a combination of linear and nonlinear operations and activation functions as follows: where i, w, b, x, and N are the input, layer weight, bias, input data, and the total number of samples, respectively. Maximum pooling is referred to as a pooling operation that calculates the maximum values from each convolution filter. The results are downsampled or pooled feature maps that highlight the present feature in the patch calculates as follows: where c stands for convolution layer values after the convolution operation. The ReLU function is a nonlinear function applied to increase the nonlinearity of the CNN feature maps that can be calculated as follows: Batch normalization is a method used to make artificial neural networks faster and more stable by calculating the mean and standard deviation of each input variable. The proposed 1D CNN network consists of three parallel pathways that extract the features from the ciphertext. Each pathway consists of two 1D convolution layers, each followed by batch normalization, ReLU, and dropout layers (see Fig. 3). The extracted features are then fed to a maxpooling layer to reduce the data dimension. The number of convolutional filters in each of the pathways was 64 and 128, respectively. Consequently, the extracted features from each pathway are concatenated using a depth concatenation operation and fed into a fully connected layer. The Softmax function is used to generate the probability of each class. Table 3 lists the proposed 1D CNN parameters.

IV. RESULTS AND DISCUSSION
This section first introduces datasets used to evaluate the efficiency of the proposed network model for ciphertext classification. Second, the details of the experimental setup for network training, network training on different datasets, network performance analysis, the impact of hyperparameter tuning, and finally a discussion of the results are given.

A. DATASETS
Experiments are conducted on the two datasets to authenticate the effectiveness of the proposed network model. The datasets were divided randomly into training and test sets. The optimal amount of training and test datasets were obtained experimentally.

Dataset-1:
The first dataset is the company reports dataset, containing documents related to different issues occurring during company operation. It consists of four hundred and eighty documents from four different classes, where each class represents a group of reports related to failure in different company sections.

Dataset-2:
The second dataset is the Brown corpus, which consists of a collection of text samples from fifteen different classes. Both datasets are encrypted using three classical cipher methods (Caesar, Substitution, and Vigenere cipher). Table 4 lists the total amount of samples in each dataset and data distribution for training and test datasets. Fig. 4 shows the visualization of the class distribution of both datasets.   Table 5 lists the system configuration for training the proposed network models.

C. MODEL TRAINING
In this phase, the main work is the training of the proposed network models over the encrypted domain. The training procedure can speed up using Graphics Processing Unit (GPU). The proposed models were implemented using Matlab deep learning library, which can be executed on GPU. This will accelerate the training process 5 to 10 times. A stochastic gradient descent (SGD) training strategy subdivides the training dataset (called mini-batches) for each training epoch. A mini-batch size of 128 was considered for training the proposed method, which yielded better performance. All optimal parameters were obtained experimentally. A dropout of 0.2 was used to prevent network overfitting. The Adam optimizer was used with a 0.001 learning rate and cross-entropy loss function [37]. The cross-entropy was defined by measuring the difference between the actual and predicted output of the model expressed as the following equation: where y stands for predicted probability by the network; y is the ground truth; i stands for the number of data; N is the total number of samples. The training progress plot demonstrates the training accuracy per mini-batch. The training plots and corresponding cross-entropy loss for each mini-batch of both encrypted datasets against plaintext using the BLSTM-GRU network model are shown in Fig. 5(a) Brown dataset and (b) Company report dataset. The classifier accuracy using the proposed BLSTM-GRU network model oscillates between 92% to 100% for the Brown dataset and 95% to 100% for the Company report dataset. Similar results are found for plaintext classification accuracy. Figure 6 presents the training process of the proposed 1D CNN model on both datasets. The classifier accuracy oscillates between 72% to 81% for the Brown dataset and 77%to 85% for the Company report dataset.

D. PERFORMANCE ANALYSIS
The performance of the classification model was evaluated using a confusion matrix, which is a widely used method for measuring the classification accuracy of machine learning methods. The confusion matrix was calculated as listed in Table 6.  represents the false negative when the actual class is 1, and the predicted class is 0. Fig. 7 presents the confusion matrix of the proposed BLSTM-GRU model performance for classifying the Vigenere encrypted text versus plaintext. The confusion matrix for Caesar and Substitution cipher encrypted text has a similar result to Fig. 7. The confusion matrix for simple plaintext without encryption shows higher accuracy than the encrypted text. Afterward, we evaluated the classification performance of the proposed BLSTM-GRU model using well-known metrics, such as the accuracy, precision, recall, and F1-Measure using equations (19), (20), (21), and (22)  The precision stands for a fraction of relevant predictions among all the predicted values that can be calculated as follows: The recall is the ratio of correctly predicted occurrences among all instances on the dataset, which can be calculated as follows: The imbalanced dataset may harm the actual network accuracy due to accuracy detection towards the majority of classes. Accordingly, the F1 measure was utilized to assess the detection performance for the proposed model, which can be calculated as follows: By taking advantage of the combination of BLSTM and GRU for ciphertext classification, a single layer of BLSTM was sufficient. Figure 8 presents the result of different evaluation metrics using the CNN and BLSTM-GRU network models on both datasets. The proposed method showed high classification accuracy for up to 15 different categories. Based on the experiments, the BLSTM-GRU network model works better than the CNN for ciphertext classification. The CNN performance for large datasets with long sequence lengths was much lower than the BLSTM-GRU network. This is because the Brown corpus dataset contains sentences with a long sequence length, and the CNN cannot process long sequence data, whereas the BLSTM-GRU network shows a high-level ability for long sequence processing.

E. THE IMPACT OF HYPERPARAMETER TUNING
Hyperparameter tuning highly impacts the training processes that are assigned by users before training. The impact of hyperparameter tuning on network training was investigated using the company report dataset encrypted according to the Vigenere cipher method.

1) WORD EMBEDDING DIMENSION
Word embedding is a useful and popular tool in modern NLP, which is usually a linear or quadratic function of dimensionality. The word embedding dimension has a profound impact on the training time and computational costs. The smaller dimensionality of word embedding cannot capture all possible word relationships, whereas a very large embedding dimensionality leads to network overfitting and slows down training. Fig. 9(a) presents the experimental result for the impact of different word embedding dimensions on the training process, and Table 7 lists the quantitative results.

2) MINIBATCH SIZE
The mini-batch stochastic gradient descent (SGD) is a widely used technique for large-scale optimization problems for training machine learning models and deep learning models. The mini-batch refers to the amount of data used in every epoch to train the network. A, excessively large batch size slows the network convergence rate, while a too-small batch size makes the network fluctuate without achieving acceptable performance. Fig.  9(b) presents the experimental results for the impact of Mini-batch size on the training process, and Table 8 lists the quantitative results.

3) IMPACT OF THE NUMBER OF LSTM HIDDEN UNITs
The number of hidden units in an LSTM refers to the dimensionality of the hidden states. Changing the number of hidden units affects the training of LSTMs. Fig. 9(c) shows the experimental result for the impact of hidden units on the training process, and Table 9 lists quantitative results.

F. DISCUSSION
This study introduced a deep learning-based network model for automated ciphertext classification with efficient performance using RNN-based cell units. Generally, the RNN-based model can store information along the sequence, which showed a better performance than other deep learning-based models by taking the temporal and spatial features into account [31]- [33]. Bidirectional LSTM enables a better understanding of context by learning future time steps in a bidirectional manner. Moreover, GRU deploys reset and update gates in the hidden layer, which is computationally more efficient than a conventional LSTM. In this paper, a hybrid network model based on the BLSTM-GRU cell units was proposed to recognize the ciphertext category automatically and accurately. In addition to features in the ciphertext document, this study focused on the temporal dependency from the input sequence using RNN-based cell units. Furthermore, to evaluate the efficiency of proposed BLSTM-GRU network model against CNN model, we proposed a 1D CNN-based network model for ciphertext classification. The proposed BLSTM-GRU method efficiency was compared with several other deep learning-based models, including proposed 1D CNN model. The results suggest the efficacy of the proposed hybrid BLSTM-GRU network model using different well-known evaluation metrics, including the F1 score, precision, and recall. The RNN-based model disadvantages can be expressed as long-term dependence problems, gradient fading, or gradient explosion problems.
In this experiment, the Adam optimization method was used to train the model, and the learning rate was set to 0.001. The dropout method is used to prevent overfitting with a factor of 0.2. In addition, this study investigated the impact of different hyperparameter tunings on network performance, including word embedding dimension, minibatch size, and the number of BLSTM-GRU cell units.
The experimental results indicated that the network could converge faster using optimal hyperparameters. The effectiveness and performance of the proposed method were assessed by comparing the proposed method with some of the other deep learning-based models shown in Fig.  10.

V. CONCLUSIONS
This paper proposed a hybrid network model based on BLSTM and GRU network model, which has recently outperformed many deep learning approaches in the sequential data processing. The BLSTM showed a better understanding of the context by learning the future time steps in a bidirectional manner. The GRU cell unit deploys an update and reset gate, which is more efficient than the conventional LSTM model. Based on the experimental results, the hybrid method yielded high classification accuracy by deploying the bidirectional learning and which enabled the extraction of more distinctive features to predict the ciphertext classes better. The proposed method can classify ciphertext in modern ciphers, which are more complex, and the relationships and dependencies are more complex to discover distinctive features for accurate cipher text classification. The limitation of the proposed model can be expressed as long-term dependence problems in long ciphertext sequences, which causes the LSTM to lose important information along the sequence. Thus, future work will include an investigation of the network's ability to