Multilingual Program Code Classification Using $n$-Layered Bi-LSTM Model With Optimized Hyperparameters

Programmers are allowed to solve problems using multiple programming languages, resulting in the accumulation of a huge number of multilingual solution codes. Consequently, identifying codes from this vast archive of multilingual codes is a challenging and non-trivial task. Considering the codes' complexity compared to natural languages, conventional language models have had limited success. Deep neural network models have achieved state-of-the-art performance in programming-related tasks. However, the multilingual code classification based on the problem name or algorithm remains an open problem. This paper presents a novel multilingual program code classification model for the code classification task based on algorithms and problem names. First, a layered bidirectional long short-term memory model is designed to better understand the complex code context. Second, preprocessing, tokenization, and encoding processes are performed on real-life datasets. Next, clean and trainable formatted data are prepared. Finally, experiments are conducted on real-life datasets (e.g., sorting, searching, graphs and trees, numerical computations, basic data structures, and their combinations) with optimized hyperparameter settings. The results show that the proposed model can effectively improve the code classification accuracy compared to other baseline models.


I. INTRODUCTION
P ROGRAMMING is one of the key techniques for de- veloping the modern information technology.Millions of codes are regularly generated in industrial and academic institutions [1], [2].Programmers solve a single programming problem using different algorithms and programming languages and consider the instructions and constraints of the problem when writing the code.As a result, diverse and multilingual source Manuscript received 2 June 2023; revised 3 October 2023; accepted 14 October 2023.Date of publication 12 December 2023; date of current version 27 March 2024.This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant 23H03508.(Corresponding author: Md.Mostafizer Rahman.)Md.Mostafizer Rahman is with the Department of Computer and Information Systems, The University of Aizu, Fukushima 965-0006, Japan, and also with Information and Communication Technology Cell, Dhaka University of Engineering & Technology, Gazipur 1707, Bangladesh (e-mail: mostafiz26@gmail.com).
Yutaka Watanobe is with the Department of Computer and Information Systems, The University of Aizu, Fukushima 965-0006, Japan (e-mail: yutaka@ u-aizu.ac.jp).
Recommended for acceptance by W. Ding.Digital Object Identifier 10.1109/TETCI.2023.3336920codes 1 are regularly accumulated and pushed out to the cloud repository [3].The manual classification of the huge number of diverse multilingual source code is a challenging and non-trivial task.Although many engineering models and approaches have been proposed to guide development, code writing is often expensive and error-prone [4].Thus, it is important to develop a classification model that can better recognize the features and context of diverse codes to assist programmers.In particular, from a programming education perspective, the code classification model can provide students, teachers, and instructors with additional benefits in finding or recognizing relevant codes based on algorithms/problem names from large code repositories to accelerate programming learning.In software engineering (SE), in turn, the code classification model can help find appropriate modules to speed up software development.The model can be used at the functional-level in the software development phases.
A considerable number of industrial and academic studies on programming-related tasks to assist and alleviate the various programming challenges faced by programmers, especially coding.Some examples are: automatic localization of errors in solution codes [5], [6], [7], [8], editing changes in solution codes [9], [10], [11], [12], code refactoring [13], [14], [15], mathematics-based formal methods and techniques for generating solution code according to the code specification [16], [17], [18], code completion [19], identification of errors (e.g.., logical and syntactic) in solution codes [20], code evaluation and repair [21], and classification of codes based on errors, algorithms, languages, domains (e.g., network, game, word, and science), and code snippets [3], [22], [23], [24], [25], [26].Classification methods are typically divided into two families: supervised learning (SL) and unsupervised learning (UL) [27].In SL, the models are trained with known pair of tuples (e.g., < input, label >).Let χ = {χ 1 , χ 2 , χ 3 , . . ., χ n } be the set of solution codes and τ = {τ 1 , τ 2 , τ 3 , . . ., τ n } be the set of the corresponding labels (problem name/algorithm) of set χ. Based on this definition, the output function of SL is written as τ = Φ(χ), where Φ is a mapping function and the τ output depends on the χ input.In contrast, UL models are trained without known corresponding input data labels, and the data are processed by mathematical methods based on their similarity features [2].In practice, collecting sufficient labeled data for a given task is often expensive and time-consuming.In order to obtain meaningful results with SL, sufficient and high-quality data are required for model training; otherwise, low-quality data may adversely affect its effectiveness.This problem is especially important in classification tasks with complex data and often referred to as the "cold start problem" [27].
In recent years, deep neural network (DNN) models have achieved profound success in various tasks, including computer vision [28], image classification with a bidirectional long shortterm memory (Bi-LSTM) network [29], [30], high-dimensional anomaly detection [31], [32], healthcare services using the data of Internet-of-Things systems [33], and online learning algorithms with a long short-term memory (LSTM) network [34].Furthermore, many approaches have been proposed to better understand the program code [35].Watanobe et al. [36] proposed a source code classification model using convolutional neural networks.The source codes in the C++ programming language were used in their experiments.Bi-LSTM and LSTM language models were also employed for the source code classification tasks [19], [20], [21].Stochastic language models (SLMs) have achieved widespread success in natural language processing (NLP), speech recognition, language translation, and handwriting recognition [19].The performance of SLMs (e.g., n-gram, bi-gram, skip-gram, and glove [37], [38]) heavily depends on a rich text corpus.However, considering the program code complexity, the different code structures using different programming languages, and code corpus limitation, SLMs have not achieved the same significant results as like NLP tasks.
As a remedy, recurrent neural networks (RNNs) have been introduced.RNNs have a built-in network memory that can store past information [39], but they cannot process long-term dependent information because the gradient exponentially increases or decreases during training.This problem is referred to as the "gradient vanishing and exploding" problem [40], [41].It prevents the capturing of long-term dependent information and significantly degrades the RNN performance in real-world implementations [42].To address the problem, the LSTM network has been introduced.It has a novel network architecture with four control gates (i.e., input, output, forget, and cell state) that can be used to overcome the "gradient vanishing and exploding" problem [42], [43].Despite the good performance of the LSTM, it only processes the information in one direction from past to future, which means that the LSTM is unidirectional [44].Resolving this problem required the introduction of Bi-LSTM network [45] that processes information in both forward and backward directions.In Bi-LSTM, two independent hidden layers (i.e., forward and backward) are connected to the same input.The results of these two layers are concatenated for the output.In practice, Bi-LSTM models have shown a much better performance compared to LSTM [44].
In this paper, we propose an n-layered Bi-LSTM model for the program code classification tasks.In n-layered Bi-LSTM model, an n number of Bi-LSTM layers is used, where the output of the hidden state of each Bi-LSTM layer is given as the input to the next Bi-LSTM layer.The "weight update" formula is similar with that in the original Bi-LSTM.The deep layered architecture of the n-layered Bi-LSTM model allows the extraction of more complex features based on previous layers.This layered Bi-LSTM model mechanism understands the complex context and features of data.The solution codes contain functions, classes, keywords, tokens, characters, numbers, operators, and variables with long-and short-term dependencies; hence, the n-layered Bi-LSTM model structure can accurately capture the dependencies and the complex context of solution codes.
We validate the performance of the proposed n-layered Bi-LSTM model by creating several datasets (i.e., sorting, searching, graphs and trees, numerical computations, basic data structures and their combination) with real-life solution codes collected from an online judge system.The datasets are composed of multilingual (i.e., approximately 15 programming languages) and solution codes with various algorithms, which ensure the high diversity of the datasets.The experimental results suggest that the performance of the state-of-the-art models (i.e., LSTM and Bi-LSTM) on our dataset is significantly inferior to that of the n-layered Bi-LSTM model.Moreover, the network hyperparameters are fine-tuned during the model experiments.The evaluation results indicate that none of the compared stateof-the-art models outperform the n-layered Bi-LSTM model.
The main contributions of this study are as follows: r We propose a novel n-layered Bi-LSTM model for the code classification task.We introduce herein a new input supply from one Bi-LSTM layer to another.The deep structure of the n-layered Bi-LSTM model can understand the complex context and features of the solution codes.
r The experimental results show a substantial improvement of the classification task on highly diverse solution codes compared to other state-of-the-art models (e.g., LSTM and Bi-LSTM).
r We create datasets using real-life solution codes that can be useful for other programming-related studies such as code generation, refactoring, code translation, and error detection.The remainder of this paper is organized as follows: Section II describes the background and the theoretical foundations of the NLP and DNN models closely related to this study; Section III presents the problem statements and the motivation of our study; Section IV explains the details of our proposed program code classification approach.Section V provides the datasets, evaluation metrics, implementation details, results, and discussion; and Section VII concludes this study with a note on the future work.

II. BACKGROUND AND THEORETICAL FOUNDATION
This section provides a brief introduction to the background and theoretical foundations of this study.Hence, the mathematical representations of the SLM, RNNs, LSTM, and Bi-LSTM models for the sequential language modeling tasks are presented.

A. n-Gram Language Model
The n-gram model is a popular language model in NLP tasks.It predicts the next words based on the word sequence probability.Let a = {a 1 , a 2 , a 3 , . . ., a n } be the set of words of sequence a, where a i is a single word.The probability of the entire word sequence ψ(a) is calculated as follows by the chain rule of probability.
If the probability of a word ψ(a i ) depends on the preceding words ψ(a i−1 1 ), it can be described through the Markov Assumption.
The n-gram uses (3) as the conditional probability of the next word in a sequence.In the following equation, N is used to indicate the n-gram size, such as N = 2 for bi-gram, N = 3 for tri-gram, and so on.
In practice, smoothing techniques are used to estimate the maximum likelihood using (4), where ϕ(•) computes the count of words.
The language model's performance can be evaluated using the cross-entropy (C e ) calculation, where a lower C e value indicates a better language model, and vice versa [46].

B. Recurrent Neural Networks
An RNN is a neural network that can process dependent sequential data x(t) = x(1), x(2), x(3), . . ., x(T − 1), x(T ) by using its ingrained network "memory".This "memory" captures all previously calculated information.The basic RNN structure is described by the following set of equations [34]: ŷt = e(V (y) h t ) where x t ∈ R j is the input vector; h t ∈ R k is the hidden state vector; and ŷt ∈ R k is the output vector.Functions d(.) and e(.) are used for the non-linearity and output of the network.The widely used functions are tanh(.)and sof tmax(.).The coefficient weight matrices of U , G, and V are U ∈ R j×k , G ∈ R k×k , and V ∈ R k×k , respectively.1) Gradient Vanishing and Exploding: Two processes are executed in a single time step: (i) forward, and (ii) backward passes.In the forward pass, the loss function (℘ t ), hidden layer state (h t ), and output (ŷ t ) are computed.The loss between the estimated (ŷ t ) and true (y t ) labels is calculated using ℘ t .The total loss is described as L = i ℘ i (ŷ t , y t ).In the backward pass, the loss function gradient for each weight matrix (δL/δU , δL/δG, δL/δV ) is computed to update weight matrices U , G, and V using the backpropagation algorithm (i.e., backpropagation through time).In the backward pass, the gradients are backpropagated through time and layers.All past contributions are summed up to the current contribution: where the contribution of a state (at time step m) to the gradient of the total loss L is calculated.Equation ( 8) has two erratic cases during backpropagation: (i) if δh i δh i−1 2 < 1, the gradient vanishes or disappears; and (ii) if δh i δh i−1 2 > 1, the gradient explodes.

C. Long Short-Term Memory Neural Networks
LSTM networks are specialized RNNs and used in various complex problem domains, such as speech recognition and machine translation.LSTMs can remember long dependent input sequences and overcome the gradient vanishing and explosion problems [42].The LSTM architecture is described by the following set of equations [42]: where c t ∈ R k is the state vector; x t ∈ R j is the input vector; and h t ∈ R k is the output vector.Here, c t , f t , i t , and o t are the cell state, forget, input, and output gates, respectively.ct ∈ R k is the candidate state obtained by the nonlinear function.Function d(.) stands for the tanh(.)function and is applied pointwise to the input vectors.Similarly, function σ(.) stands for the sigmoid function and is applied pointwise to the vector elements.The coefficient weight matrices and vectors are as follows:

D. Bidirectional Long Short-Term Memory Neural Networks
Bi-LSTM extends the unidirectional LSTM with a new hidden layer that passes information in the backward direction [45].Thus, the forward hidden layer ( − → H ) starts with the first token of the sequence, while the backward hidden layer ( ← − H ) starts with the last token.In other words, the connection flow from one hidden layer to another hidden layer is in a reverse temporal order.
The internal structure of the Bi-LSTM model is useful for understanding the complex context of dependent information, such as time series and language modeling tasks.Fig. 1 presents the basic architecture of the Bi-LSTM.The Bi-LSTM model architecture [45] is described by the following set of equations: where the coefficient weight matrices and vectors are as follows: ) is an activation function usually set to the hyperbolic tangent function (i.e., tanh(.)).Also, the function d(.) applies input vectors pointwise.
Here, two hidden states ( − → h t and ← − h t ) are concatenated as h t and fed to the output layer.
where h t ∈ R j×2 k is the weight matrix of the hidden state.Finally, the output o t ∈ R j×q is computed in the output layer (q: number of outputs) described as follows: where the coefficient weight matrices and vectors are as follows: is the activation function of the output layer, which is typically set to the softmax function (i.e., sof tmax()).

III. PROBLEM STATEMENT AND MOTIVATION
In recent years, DNN models have achieved great success in NLP tasks [47], [48] due to the richness of the natural language corpus.The structure of natural languages is the same, which helps in the collection of a large corpus.In contrast, programming problems can be solved using many programming languages.Programmers write codes in their own style.There are no predefined ways to solve programming problems.Consequently, the heterogeneity of code structures and programming languages and the complexity of codes are not comparable to natural languages.Fig. 2 shows an example of a problem (selection sort) solved with three different programming languages (i.e., Java, C++, and Python).The variables, operators, methods, classes, keywords, input/output functions, header functions, and code composition of these solutions are completely different, making programming codes even more complex and diverse.For example, the number of heterogeneity/complexity features of a code can be described as follows: of the code sequences goes to the dense layer for further processing according to Fig. 4. 20: Predicts the class label of the solution code using Softmax at the output layer according to (41).
is the number of variables.Therefore, understanding these huge complex and diverse codes with DNN models is a difficult and non-trivial task compared to using natural languages.We are motivated to address the problem of "how the DNN model can better understand the diverse and complex codes" by developing a novel DNN model with a very deep structure.The deep structure helps to gain an in-depth understanding of code features through repeated input data (or codes) learning.

IV. PROPOSED APPROACH
The proposed code classification approach consists of two main phases, namely code preprocessing and classification, with the n-layered Bi-LSTM model.Algorithm 1 describes the overall code classification process as a pseudocode.The overview of the proposed method and the architecture of the n-layered Bi-LSTM model are described below.

A. Overview of the Classification Model
Fig. 3 presents an overview of the proposed model, where the left part shows the code preprocessing, and the right part is used for the model training and classification processes.The code preprocessing aims to remove irrelevant information from the original solution codes.It is considered as one of the most vital and primary tasks for the solution code classification.We obtained the ideal format of solution codes by following the code preprocessing steps of these studies [19], [20], [44].To this end, we removed all irrelevant information (e.g., comments, tabs, spaces, and line breaks) from the solution codes for the model training and evaluation.Each solution code was then converted into a sequence of token words.Each token word was assigned a unique token number or integer index.Let W = {w 1 , w 2 , w 3 , . . ., w v } be the set of word sequences of a solution code and the corresponding mapping number be O = {o 1 , o 2 , o 3 , . . ., o v }.To train the model, the lengths of the code sequences were kept the same by padding or truncation.
Meanwhile, the classification process starts with the embedding layer used to convert each integer index into a real-valued feature vector.These real-valued feature vectors of the code tokens are combined to form a matrix, called the embedding matrix [49].Each row of the embedding representation indicates the original word of the code sequence.Note that the embedding representation in the matrix is identical for the same words.The embedding matrix is described by (19).
where E ∈ R v×d is the embedding matrix; v is the vocabulary size of the solution codes; and d is the embedding dimension of the dense vector.In this paper, the dimension (v × d) (i.e., 10, 000 × 200) of the embedding matrix E ∈ R v×d is used as the pretrained word embedding vector.Next, the vectorized code information (i.e., < solution code, target label >) is propagated to the DNN layers (i.e., n-layered Bi-LSTM and dense layer/fully connected layer) for feature learning of the solution codes.Finally, the output layer is used for the solution code classification.

B. Architecture of the N-Layered Bi-LSTM Neural Network
In this section, we present our proposed n-layered Bi-LSTM model architecture.A multilayer or stacking architecture of Bi-LSTM neural networks further improves the classification or regression performance [50], [51].Moreover, existing works [52], [53], [54] have shown that deep hierarchical architectures with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.several hidden layers are more efficient than shallow ones because they can build higher-level representations of dependent sequential data.The layered Bi-LSTM architecture can provide a higher prediction performance by obtaining richer contextual information of the solution codes from both past and future sequences.The deep architectures consist of multiple hidden layers, where the output of one hidden layer is used as the input to the subsequent hidden layer.This layered mechanism enhances the neural network performance.We adopt this concept in this work.Fig. 4 illustrates the graphical architecture of the proposed n-layered Bi-LSTM for the code classification task.
In this scheme, the solution codes are first preprocessed in the input layer, where the lengths of the token sequences of each code are the same.Next, an embedding matrix (E ∈ R v×d ) is created based on the code sequences in the embedding layer used as the input to the deep n-layered Bi-LSTM.In this architecture, for a particular time step t, the input is given to the hidden layers in the forward direction to learn information from the past.The same input data are fed to the hidden layers in the reverse direction to learn future information.The output y t is obtained by concatenating the initial outputs of both the forward and reverse layers.The hidden units of the upper (next) layers take the output of the lower (previous) layers to determine the detailed contextual information from the solution codes.For example, y (1) t is the output of the 1-layered Bi-LSTM used as the input for the 2-layered.Algorithm 2 describes the complete n-layered Bi-LSTM algorithm as a pseudocode.The n-layered Bi-LSTM is connected to the dense or fully-connected layer to further process the contextual representation of the solution codes.Finally, the output layer classifies the class label of the solution code using softmax.
Remark 1: In the n-layered Bi-LSTM architecture, the output of the lower layer becomes the input of the next layer.
For a given time step t, the output of the first layer of the n-layered Bi-LSTM architecture is defined by the following set of equations.First, the forward layer of 1-layered Bi-LSTM.
of the n-layered Bi-LSTM according to (40) 7: end for 8: Return y c where , and b (1 f ) c ∈ R k are the bias vectors.Second, the backward layer of the 1-layered Bi-LSTM is described as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. h where  25) and ( 31) are reformulated as (1) where V (1)  y ∈ R 2 k×q is the coefficient weight matrix, and b (1)  y ∈ R q is the bias vector.The output of the lower layer is fed to the next subsequent layer (Fig. 4).Thus, the output (y (1) t ) of the 1-layered Bi-LSTM is fed to the second layer (i.e., 2-layered Bi-LSTM).Equations ( 32)-( 34) are written below for the 2layered Bi-LSTM.Note that the input of the 2-layered Bi-LSTM is where V (2)  y ∈ R 2 k×q is the coefficient weight matrix, and b (2)  y ∈ R q is the bias vector.Similarly, the n-layered Bi-LSTM is described as follows, where the input is where V (n) y ∈ R 2 k×q is the coefficient weight matrix, and b (n)  y ∈ R q is the bias vector.Finally, a softmax layer is added on top of the output layer to model the multi-class probabilities, and described as follows: where C is the number of classes, and numerator e Y i is the exponential function applied to each element of Y .Denominator C j=1 e Y j is the sum of the exponential functions of all elements.

V. EXPERIMENTAL RESULTS
We conducted a series of experiments on seven different datasets and their combination to evaluate the code classification performance of our proposed model.The datasets, evaluation matrices, implementation details, and results are elaborated in detail in the subsequent sections.

A. Datasets
The datasets used in our experiments are extracted from the Aizu Online Judge (AOJ) system [55], [56].The growing resources of the AOJ system have been used by various ML/AI-based projects in recent years.For example, IBM and Google DeepMind have used the solution codes of AOJ in their CodeNet [57] and AlphaCode [58] projects, respectively.Fig. 5 provides an overview of the dataset creation using the AOJ system.First, we extract the solution codes from the AOJ based on different algorithms and problem names.Second, duplicate solution codes and irrelevant elements are removed from the codes.Third, a cross-validation of the solution codes is performed to ensure that the codes are in order after removing the irrelevant elements.Finally, the datasets are created.
We use seven different datasets in the experiments, namely Sorting, Searching, Graph & Tree (G&T), Numerical Computation (NC), Basic Data Structures (BDS), and their combinations.The dataset combination is used in the experiments to increase the dataset diversity and complexity.We assign a ground truth value or label to the solution codes based on the algorithm used or the problem name.For example, the solution codes of Counting Sort are labeled as Counting Sort.The sorting dataset contains solution codes for various sorting algorithms (e.g., Bubble Sort, Insertion Sort, Selection Sort, Merge Sort, Counting Sort, Shell Sort, and Quick Sort).Similarly, the Searching, G&T, BDS, and NC datasets contain the solution codes of various algorithms and problems in each category.Table I presents the statistics for these datasets.

B. Evaluation Metrics
In classification spaces, the classifier performance is usually defined by the confusion matrix with respect to the classifier.The accuracy, recall (sensitivity), precision, and F1-score (recall and precision) are calculated from the confusion matrix entries.We follow the standard evaluation approach [19], [49], [59] for our code classification task.First, the multi-class classification accuracy described below is defined as the average number of correct predictions relative to the total number of predictions.
where Q is the function that returns 1 if the class is true and 0, otherwise.C is the number of classes.f (x) ∈ C = {1, 2, 3, . . ., n}.
The program code datasets may contain various data imbalances.Advanced evaluation metrics are adopted to obtain an unbiased performance evaluation for the imbalanced datasets.

TABLE I STATISTICAL OVERVIEW OF THE DATASETS AND THEIR DISTRIBUTIONS FOR MODEL TRAINING AND VALIDATION
In addition to the accuracy, precision, recall, and F1-score, we also perform calculations for the macro, micro, and weighted settings.The micro-average is calculated across all samples, summing all true positives (TP), false negatives (FN) and false positives (FP).Thus, the micro-precision (P μ ), recall (R μ ), and F1-score (F 1 μ ) are calculated by ( 43)-( 45): A higher F 1 μ score indicates a better overall performance of the classification model.Therefore, it is not sensitive to individual classes because an imbalanced class data distribution can misrepresent the overall performance of the classification model.However, the macro-average takes into account the performance of individual classes.A higher macro F1-score means a better model performance for the individual classes.If the class data distribution is imbalanced, the macro-average is more appropriate than the micro-average.The macro-precision (P ρ ), recall (R ρ ), and F1-score (F 1 ρ ) are calculated using the following set of ( 46)- (48): The weighted average F1-score, on the other hand, is computed from the mean of all F1-scores of the individual classes considering the support of the individual classes."Support" refers to the number of class instances.The term "weight" refers to the ratio of instances of each class to the sum of all instances.The weighted-precision (P ω ), recall (R ω ), and F1-score (F 1 ω ) are calculated as follows in ( 49)-( 51): where |s k | is the support of the k class, and |S| is the sum of all supports.The Cohen Kappa (κ) [59] score is used to evaluate the classification model's performance and calculated as follows: where φ o is the observed probability indicating the classification model accuracy (or perfect match), and φ e represents the model prediction and actual class value by random matching.The Area Under Receiver Operating Characteristic Curve (ROC AUC) [60] is employed to evaluate our multi-class classification model.In this context, two strategies are considered for the classification model evaluation: "one-vs-one (OvO)" and "onevs-rest (OvR)".Here, OvO calculates the average of the pairwise ROC AUC score, and OvR computes the average score for each class against other classes.These results seem particularly optimistic when there is a strong class imbalance, that is, the number of minority class instances is small.

C. Implementation Details
We use different hyperparameters in the proposed n-layered Bi-LSTM model to achieve better results.A program code is usually a collection of complex instructions, including mathematical operations, methods, functions, variables, keywords, and tokens.They are interrelated in a program code.The selection of optimal hyperparameters is greatly important in understanding complex code interrelationships.We employ three (n = 3) Bi-LSTM layers as part of the DNN architecture.The Adam [61] is applied as the network optimization method.Different numbers are used for the batch size (β), (i.e., β = {16, 32, 64}) and the learning rate (η) (i.e., η = {0.001,0.005, 0.01}).The activation functions in the dense layer are sigmoid 1+e −2z ), and relu (Λ(z) = max(0, z)).We also conduct experiments with and without a dropout layer, where the dropout layer is placed before the dense layer.The dropout (ξ) values are 0.1 and 0.3.The sparse categorical cross entropy is used as the loss function (L) of the classification model described  (53).
where w is the model parameter; C is the number of classes; and y c and ŷc are the true and predicted labels, respectively.

D. Results
Tables III-VI present the quantitative classification results (%) of the macro-precision (P ρ ) and recall (R ρ ) and the weighted-precision (P ω ) and recall (R ω ) of the proposed nlayered Bi-LSTM and other state-of-the-art models over the seven datasets.The hyperparameter settings (i.e., learning rate η = 0.001, epoch = 50, batch sizes β = {16, 32, 64}, activation functions Υ = {tanh, sigmoid, relu }, and dropout ξ = {none}) are applied in these experiments.It is observed that the LSTM model fails to achieve better results compared to the Bi-LSTM and 2-and 3-layered Bi-LSTM models.The LSTM model particularly achieved low P ρ , R ρ , P ω , and R ω scores in the NC, G&T, and Sorting + Searching + G&T datasets.The LSTM model also fails to provide good results with different hyperparameters for the Searching dataset.In contrast, the 2-layered Bi-LSTM model achieves better P ρ , R ρ , P ω , and R ω scores compared to the LSTM and Bi-LSTM models.the LSTM performance.The Bi-LSTM solves the problems to some extent and significantly improves the classification results.However, this performance is worse than that of the layered Bi-LSTM.The LSTM and Bi-LSTM models struggle to produce better results compared to the layered Bi-LSTM models on the G&T, Sorting + Searching, and Sorting + Searching + G&T datasets.The deep layered architecture enables the layered Bi-LSTM model to better learn the complex context of large diverse codes.Considering the same hyperparameter settings, the A, κ, and OvR AUC ROC scores for all models over the seven datasets are also calculated (Table VII).The results show that the 2-layered Bi-LSTM model outperforms the other models with an average A of 97.00 ± 0.10, κ of 96.29, and OvR of 99.68.Although the 3-layered Bi-LSTM model achieves relatively better results (A of 96.30 ± 0.07) than the LSTM (A of 90.19 ± 0.50), it is not as good as the 2-layered Bi-LSTM.These results show that the 2-layered Bi-LSTM model is more suitable for the current datasets.Thus, we avoid 3-and more-layered Bi-LSTM models in further experiments.
We evaluate the model performance using different hyperparameters (e.g., β, Υ, η, and ξ), as shown in Figs. 10, 11, 12, and 13.The experiments provide various additional insights.Figs. 10, 11, and 13 depict that hyperparameters (β, Υ, and ξ) only have light effects on the F 1 ρ score when considered.Similar performance trends observed in the calculated A and F 1 ω scores.By contrast, a significant impact on the F 1 ρ score is found when different learning rates (e.g., η = {0.001,0.005, 0.01}) are considered for all the models across the seven datasets, as shown in Fig. 12.The F 1 ρ scores are significantly improved when η is set to 0.001 compared to η = 0.005 and η = 0.01, implying that slowing down the η improves the model performance.The average F 1 ρ scores for        Bi-LSTM models on all three evaluation metrics F 1 ρ , F 1 ω , and A across the seven datasets.
However, it is worth mentioning that even though the LSTM with Attention model performed well, it still fell short when compared to the layered Bi-LSTM models.The 3-layered Bi-LSTM model, in particular, outperformed all other SOTA models, achieving F 1 ρ , F 1 ω , and A scores of 93.42%, 96.55%, and 96.56%, respectively.These results further underscore the effectiveness of the proposed layered Bi-LSTM model compared to SOTA models.

A. Performance Analysis
We propose herein an n−layered Bi-LSTM model for the program code classification task that takes into account the complex context and diversity of program codes.We evaluate the model performance by training, validating, and testing the model using real-world program codes.The quantitative classification results (Tables III-VI) show that the layered Bi-LSTM model achieves better classification results compared to the LSTM and Bi-LSTM models.Table VII illustrates that the 2-layered Bi-LSTM model achieves approximately 0.32% and 6.27% higher A compared to the Bi-LSTM and LSTM models, respectively.In addition to A, the 2-layered Bi-LSTM model also obtains higher κ and OvR scores.However, despite the good performance of these models, a notable performance difference is observed between the F 1 ρ and F 1 ω scores for most datasets (Figs.6−9).Underlying this difference is the class imbalance in the test data.That is, some classes contain many instances, while some have relatively few instances.Fig. 17    overall classification results of the proposed 2-layered Bi-LSTM model fully reflect its superiority over the other state-of-the-art models.

B. Impact of Hyperparameters
The DNN model performance is highly dependent on the optimal hyperparameters.Choosing the optimal parameters is a non-trivial task because it requires heavy parameter fine-tuning.We utilize different hyperparameter sets during the model training and evaluation to achieve better results.Tables III-VII present the model performances based on different hyperparameter settings.Figs.10-13 depict the classification results of the LSTM, Bi-LSTM, and 2-layered Bi-LSTM models based on various values of parameters β, Υ, η, and ξ.In particular, Fig. 12(c) shows the effects of the η changes on the model performance.The 2-layered Bi-LSTM model exhibits an improved performance of approximately 4.27% with η = 0.001 when compared to that with η = 0.005 and approximately 9.46% with η = 0.001 when compared with η = 0.01.On the other hand, Figs. 10, 11, and 13 illustrate the limited effect of the parameters on the model performance across all datasets.Figs.14-16 show the comparative results of the F 1 ρ , F 1 ω , and A scores using η values of 0.001, 0.005, and 0.01, respectively, with a specific hyperparameter set.The results demonstrate the importance of selecting optimal hyperparameters for the model performance.

C. Scalability of the n−layered Bi-LSTM Model
We observe the model performance by combining three different datasets to prepare the Sorting + Searching + G&T dataset to ensure higher complexity, data size, and diversity.Table VII shows that the 2-layered Bi-LSTM, 3-layered Bi-LSTM, Bi-LSTM, and LSTM models achieve A scores of 96.48%, 96.28%, 96.06%, and 95.77%, respectively, for the Sorting + Searching + G&T dataset with a Υ → Sigmoid.Similar trends are observed for the Sorting + Searching dataset, where the 2-and 3-layered Bi-LSTM models obtain higher A scores.This is due to the fact that the layered Bi-LSTM model considers a higher number of  IX).The 2-layered Bi-LSTM model specifically considers an average of 0.961 million more trainable parameters than the Bi-LSTM model, while the 3-layered Bi-LSTM model considers an average of 1.92 million more trainable parameters.In addition to considering a higher number of trainable data parameters, the layered structure and the propagation of the trainable data from one layer to another help the model to better understand the dependencies and correlations of the variables, functions, classes, tokens, and keywords of codes.However, the layered architecture can also be useful in other application domains, where tasks involve complex, diverse, and large datasets.In such cases, a layered Bi-LSTM model can be adopted by simply expanding the number of layers (e.g., n = 3, 4, 5, • • • ).
In addition, we conducted experiments using real-world datasets to showcase the scalability of the proposed n-layered Bi-LSTM model.We utilized two real-world datasets sourced from Project_CodeNet [62] for language classification tasks.First, we performed experiments with the Project_CodeNet_LangClass (PCL) dataset [63], which contains program code written in ten (10) programming languages, including 'Haskell', 'JavaScript', 'C#', 'C++', 'PHP', 'C', 'D', 'Rust', 'Java', and 'Python'.For these experiments, we set hyperparameter values as follows: epoch = 50, Υ = ReLu, β = 64, ξ = none, and η = 0.001 for all models.The results of the language classification tasks are presented in Table X.Notably, the LSTM model did not yield favorable results compared to the Bi-LSTM and 2-layered Bi-LSTM models.The Bi-LSTM model achieved F 1 ρ , F 1 ω , and A scores of 96.18%, 95.14%, and 95.23%, respectively.In contrast, the 2-layered Bi-LSTM model achieved 100% accuracy across all evaluation metrics, outperforming the performance of other models.It is worth noting that while the data size in the Project_CodeNet_LangClass dataset is not particularly large, Furthermore, we conducted a similar experiment with another real-world dataset known as Mini_Project_CodeNet (MPC) [64], which comprises approximately 8,819 solution codes written in six (06) different programming languages, including 'C++', 'Java', 'Ruby', 'Go', 'Python', and 'C'.The number of solution codes in this dataset significantly exceeds that of PCL dataset.In this experiment, the Bi-LSTM model achieved F 1 ρ , F 1 ω , and A scores of 97.02%, 96.75%, and 96.76%, respectively, as shown in Table X.Conversely, the 2-layered Bi-LSTM model obtained F 1 ρ , F 1 ω , and A scores of 97.47%, 97.24%, and 97.25%, respectively, surpassing the performance of the Bi-LSTM model.Intriguingly, the 3-layered Bi-LSTM model achieved even better results than the 2-layered Bi-LSTM.This outcome underscores that the layered model possesses a deeper understanding of the code, enabling more accurate classification based on the programming languages used.

D. Ablation Studies
Since the proposed n-layered Bi-LSTM model comprises various components, including layers and hyperparameters, all of which significantly impact the overall model performance, it becomes crucial to assess the individual contributions of these components.In light of this context, we conducted a series of ablation tests aimed at elucidating the effects of these components on the model's performance.It's worth noting that these ablation tests are carried out using the 2-layered Bi-LSTM model on the Searching dataset.Initially, we investigated the influence of the Dropout Layer on model performance.When the dropout value is set to ξ = 0.1, the model achieved F 1 ρ , F 1 ω , and A scores of 95.55%, 93.64%, and 93.62%, respectively.For ξ = 0.3, the model attained F 1 ρ , F 1 ω , and A scores of 95.00%, 93.21%, and 93.18%, respectively.However, when the Dropout Layer is entirely removed, the model yielded even better results, with F 1 ρ , F 1 ω , and A scores of 95.76%, 93.97%, and 93.95%, respectively.These results indicate that the dropout layer in the proposed layered model has a relatively minor impact, leading us to conduct subsequent experiments without the dropout layer.
Furthermore, we conducted experiments both with and without the Dense Layer, as summarized in Table XI.With the dense layer, the 2-layered Bi-LSTM model achieved A scores of 93.95%, 93.57%, and 93.90% for the activation functions ReLu, Sigmoid, and T anh, respectively.Conversely, when the dense layer is removed, the model obtained an A score of 93.24%.It is evident that the model's performance is adversely affected in the absence of the Dense Layer, indicating its significant contribution to the overall model performance.

E. Suitability for Programming Learning and Software Engineering
The proposed layered Bi-LSTM model yields significant results in classifying real-world program codes based on their algorithm or problem names.Searching and recognizing program codes from large code repositories are challenging task for programmers, especially for students.In this case, the proposed classification model can provide programmers/students with additional advantages in searching for and recognizing the desired program codes in large repositories.This model can ease programmers' programming method and improve his/her technical skills.The experimental results show that the proposed classification model classifies complex and diverse codes with a higher degree of accuracy.Furthermore, the model can be assimilated with existing programming learning platforms (e.g., OJ systems).Meanwhile, in SE, the reuse of software modules is one of the most important processes required to realizes a faster development.Accordingly, searching and recognizing software modules are the key tasks, to which the proposed model can be applied.The model can also be extended to various SE tasks, such as defect detection and code refactoring and review.In addition, the proposed model (classification of algorithms/codes) can be a foundation for many other machine learning models for coding tasks.

VII. CONCLUSION
In this paper, a layered Bi-LSTM model for the program code classification is proposed.The architecture and the theory of the layered Bi-LSTM model were also described.The deep architecture of the layered Bi-LSTM model can classify complex, large, and diverse program codes with a high degree of accuracy.The experimental results on seven real-world datasets, namely Sorting, Searching, NC, G&T, Sorting + Searching, NC+BDS, and Sorting + Searching + G&T, showed that the 2-layered Bi-LSTM model outperforms state-of-the-art models like the LSTM and the Bi-LSTM.Furthermore, the dataset diversity and complexity were increased when multiple datasets (i.e., Sorting + Searching, NC+BDS, and Sorting + Searching + G&T) were combined to verify the model performance.Accordingly, the 2-layered Bi-LSTM model achieves better classification results than the other models.Various hyperparameters (e.g., β, Υ, ξ, and η) were fine-tuned to achieve better results with the models.We also investigated the suitability of the proposed model in the domains of programming learning and software engineering.
In the future, the proposed model can be considered as a language model for generating correct codes against erroneous ones.In this case, the code pairs < erroneous, correct > can be used for model training, validation, and evaluation.Furthermore, the model can generate the corresponding correct codes for the given erroneous ones.

Multilingual
Program Code Classification Using n-Layered Bi-LSTM Model With Optimized Hyperparameters Md.Mostafizer Rahman and Yutaka Watanobe , Member, IEEE Abstract-Programmers are allowed to solve problems using multiple programming languages, resulting in the accumulation of a huge number of multilingual solution codes.Consequently, identifying codes from this vast archive of multilingual codes is a challenging and non-trivial task.Considering the codes' complexity compared to natural languages, conventional language models have had limited success.Deep neural network models have achieved state-of-the-art performance in programming-related tasks.However, the multilingual code classification based on the problem name or algorithm remains an open problem.This paper presents a novel multilingual program code classification model for the code classification task based on algorithms and problem names.First, a layered bidirectional long short-term memory model is designed to better understand the complex code context.Second, preprocessing, tokenization, and encoding processes are performed on real-life datasets.Next, clean and trainable formatted data are prepared.Finally, experiments are conducted on real-life datasets (e.g., sorting, searching, graphs and trees, numerical computations, basic data structures, and their combinations) with optimized hyperparameter settings.The results show that the proposed model can effectively improve the code classification accuracy compared to other baseline models.Index Terms-Multilingual program code, code classification, bidirectional lstm (bi-lstm), layered bi-lstm, programming learning.

Fig. 2 .
Fig. 2. Motivational example: A selection sort problem is solved with three different programming languages (Java, C++, and Python).

Fig. 3 .
Fig. 3. Overview of the proposed code classification model.

Algorithm 2 :
n-Layered Bidirectional LSTM (Bi-LSTM).1: Input: Embedding vectors E ∈ R v×d using tokenized code sequences i ∈ I according to (19).2: Output: y (n) t : high-dimensional vector representation of code sequences, where t is the time step, and n is the number of layers in the Bi-LSTM.3: for n = 1 : N do 4: Calculate forward hidden state h (n f ) t of the n-layered Bi-LSTM according to (38) 5: Calculate backward hidden state h (n b ) t of the n-layered Bi-LSTM according to (39) 6: Concatenation of the both hidden states y and G (1 b ) c ∈ R k×k are the coefficient weight matrices to learn.b (1 b ) i , b (1 b ) f , b (1 b ) o , and b (1 b ) c ∈ R k are the bias vectors.Briefly, (

Figs. 6 -
9 show the average F 1 ρ and F 1 ω scores for each dataset.It can be seen that the 2-layered Bi-LSTM model outperforms the other models, achieving an average F 1 ρ score of 96.10 ± 0.60 and an F 1 ω score of 97.10 ± 0.25.The Bi-LSTM model achieves an average F 1 ρ score of 95.00 ± 0.45 and an F 1 ω score of 96.20 ± 0.45.The LSTM model yields an average F 1 ρ score of 87.00 ± 0.70 and an F 1 ω score of 89.00 ± 0.10 due to the long-term dependencies, diversity, and complex context of the solution codes greatly affecting
presents the computation time (in second) for the model training for each dataset.It is seen that the 3-layered Bi-LSTM model takes more time than the other models.In other words, increasing the number of layers requires more time for model training.However, the

TABLE III QUANTITATIVE
CLASSIFICATION RESULTS (%) OF THE MACRO-P ρ AND R ρ AND THE WEIGHTED-P ω AND R ω USING THE LSTM TABLE IV QUANTITATIVE CLASSIFICATION RESULTS (%) OF THE MACRO-P ρ AND R ρ AND THE WEIGHTED-P ω AND R ω USING THE BI-LSTM

TABLE V QUANTITATIVE
CLASSIFICATION RESULTS (%) OF THE MACRO-P ρ AND R ρ AND THE WEIGHTED-P ω AND R ω USING THE 2-LAYERED BI-LSTM TABLE VI QUANTITATIVE CLASSIFICATION RESULTS (%) OF THE

TABLE VIII QUANTITATIVE
RESULTS (%) FOR CLASSIFICATION WITH THE SOTA MODELS