Machine Learning Based Embedded Code Multi-Label Classification

With the development of Internet of Things (IoT) technology, embedded based electronic devices have penetrated every corner of our daily lives. As the brain of IoT devices, embedded based micro controller unit (MCU) plays an irreplaceable role. The functions of the MCUs are becoming more and more powerful and complicated, which brings huge challenges to embedded programmers. Embedded code, which is highly related to the hardware resources, differs from other popular programming code. The hardware configuration may be a big challenge to the programmers, who may only be good at software development and algorithm design. Online code searching can be time consuming and cannot guarantee an optimal approach. To solve this problem, in this paper, an embedded code classifier, which is designed to help embedded programmers to search for the most efficient code with precise tags, is demonstrated. A high quality embedded code dataset is built. A tag correlated multi-label machine learning model is developed for the embedded code dataset. The experimental results show that the proposed code dataset structure is proved to be more efficient on embedded code classification. The proposed embedded classifier algorithm shows a promising result on embedded code dataset. And it outperforms the traditional machine learning text classification models.


I. INTRODUCTION
The Internet of Things (IoT) has experienced an explosive growth and becomes a technical revolution following the Internet. Embedded based electronic devices, as the key components in IoT, have penetrated every corner of our daily lives. In IoT field, the embedded based microcontroller unit (MCU) plays an irreplaceable role because of its advantages of real time processing and low power consumption. The world's leading MCU manufactures, such as ST, TI, NXP, and etc., are upgrading their microcontrollers at an unprecedented speed. The functions of the MCU are becoming more and more powerful and complicated as well. The increasingly complex function of the MCU undoubtedly brings huge challenges to embedded programmers.
Embedded code, mostly written in assembly and C language, differs from other popular programming code (e.g. Java, Python). It is highly related to the hardware resources. Specifically, embedded code may include function The associate editor coordinating the review of this manuscript and approving it for publication was Seifedine Kadry . module selection, register configuration, memory allocation and etc. The cumbersome hardware configuration is an extremely big challenge to the programmers, as most of them are good at software development and algorithm design, but may not be good at hardware configuration. In most cases, embedded programmers refer to online resources such as the relevant forums and official websites to search for sample codes. However, many online embedded code recourses are not well organized. In other words, online embedded code searching is not an optimal approach because various hardware selections may lead to different functions eventhough the software and algorithms are similar. This procedure is usually very time consuming for embedded programmers to figure out. A high quality embedded code with precise tags is what an embedded programmer desires the most [1].
To solve the above mentioned problem, a tag correlated machine learning based classifier is proposed to assist embedded programmers to search for the best matched code segment through more specific tags. First of all, a high quality embedded code dataset was built. The code segments in the dataset were mainly collected from the official websites and Stack VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Overflow, which had been verified to be efficient. Next, a tag correlated multi-label code classifier algorithm was developed to label each code segment with precise tags (including the chip model, the function module, and the register name). Embedded programmers thus can query the most desired codes according to the precise tag information. As mentioned above, there are two crucial aspects in this design: a high quality dataset and an efficient multi-label algorithm. The dataset design integrated a unique structure for this specific application; while the multi-label algorithm was inspired by the sophisticated machine learning based multi-label text classification methodology used in Natural Language Processing (NLP). The tag correlated multi-label classifier methodology adopts some mature scheme of machine learning based text classification.
The rest of this paper is organized as follows. The embedded code structure is analyzed and typical multi-label text classification algorithms are addressed and compared in Sec. II. The proposed dataset structure and refined multi-label algorithm are proposed and illustrated in Sec. III. Experimental results and data analysis are demonstrated in Sec. IV. Sec. V includes the concluding remarks and the description of future works.

II. EMBEDDED CODE STRUCTURE AND TEXT CLASSIFICATION
Before introducing the proposed design, we first present the structure of embedded code composition and its inherent features. Moreover, the vectorization methods (CountVec, TF-IDFVec) and the typical algorithms for multi-class and multi-label text classification methodology, such as Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR), One-to-Rest, Classifier Chains, and etc., are explained. The evaluation metrics are addressed and the advantages and limitations of each algorithm are also considered. At last, popular deep learning model TextCNN is briefly introduced.

A. EMBEDDED CODE COMPOSITION
The input data of our design is embedded code. Each collected embedded code has two parts: code description and code content. The code description is composed of code title and code comment, and it is written in NLP style to point out which application this embedded code is used for. It also make the code easier to be understood by other programmers. The processing methodology of code description is equivalent to text classifier used in NLP. Whereas the embedded code content has its own unique format and feature. The typical structure of the embedded code content is composed of hardware declaration and configuration, self-defined variables, and the code algorithms, as clearly shown in Fig. 1. The self-defined variables and algorithms are defined at random, which cannot be utilized as the feature of the code. The consistent and unique feature within one embedded code segment is its hardware related configuration and some fixed functions, e.g. the assigned register value and the interrupt service function. In other words, the only valuable information hidden in an embedded code segment is its hardware information and the specific function, which are the unique features to be extracted.
As mentioned in Sec. I, embedded code is hardware related. A typical feature of embedded code is that it contains the hardware information of the MCU, such as register of I/O, Timer, AD, and etc. In return, the hardware information reflects which function module and which register is used. It is capable to label a code segment with more precise tags according to the hardware information.

B. TEXT CLASSIFICATION
Text classification is a classic supervised learning algorithm used for assigning tags or categories to text document according to its inherent feature in the content. It is one of the fundamental tasks in NLP field with broad applications such as, news labeling, spam detection, and sentiment analysis [2]. As shown in Fig. 2, the input is text or document, and the output is a predicted class c i ∈ C, where C is a fixed set of classes C = {c 1 , c 2 , · · · , c n }. The step 2 of Fig. 2, called text preprocessing or text cleaning, is applied before the classification model to avoid ''garbage in, garbage out''. The text cleaning techniques work extremely well so far in practice. For our particular data set, the text cleaning step includes C language decoding, stop words, punctuation, and bad characters removing, lower case modifying, etc. In step 3, the text in the document is vectorized. Some popular approaches, such as CountVec, TF-IDFVec [3], and Word2Vec are the prevalent models in text vectorization. In this design, both CountVec and TF-IDFVec are tested. (1) gives the mathmathics representation of TF-IDF: where TF is term frequency, and IDF is the inverse document frequency, which can be illustrated as: Here N represents the total number of texts in the corpus, and N (x) represents the total number of texts in the corpus containing words x. In case N (x) = 0, IDF is optimized as: With the fast development of data science, machine learning is proved to be efficient on large scale text classification tasks. The work principle diagram of machine learning based text classification can be found in Fig. 3. Typical traditional text classification algorithms, like the Naive Bayes (NB) [5], Logistic Regression (LR) [6], and Support Vector Machine (SVM) [7], achieve good results on a relative small scale text. With the considerable powerful computing hardware progress, the deep learning model, such as Convolutional Neural Networks (CNN) [8], has proved to be effective on large scale text classification. From traditional approach to the deep learning model, various text classification algorithms have been verified in many application scenarios, such as the news classification, spam email filter, and multi-labeling [9]- [11].
According to specific application scenario, multiclass classification (MCC) and Multi-label classification (MLC) are the dominant two categories in text classification area. As for MCC, the classes are mutually exclusive, the predicted document is labeled with only one unique tag out of a set of classes C = {c 1 , c 2 , · · · , c n }. Whereas, for MLC, the predicted document is assigned one or more tags, and the tags are somehow related.
In this research, a code segment has one or more functions, which means each code segment is tend to be labeled with one or more classes; therefore, the MLC is chosen. Moreover, some MLC algorithms are generated from MCC, like the One-to-Rest algorithm and Binary Relevance algorithm.

1) NB TEXT CLASSIFICATION
NB text classification algorithm originated from Bayes Theorem. The conditional probabilities of occurrence of two events based on the probabilities of occurrence of each individual event [12], [13]. This means that any vector that represents a text contains the probabilities of word's appearance within the text of a given category, so that the algorithm can compute the likelihood of that word's belonging to the category [14]. Naive Bayes contains a family of statistical algorithms one can make use of when conduct text classification. One of the members of NB family is Multinomial Naive Bayes (MNB) [3]. Its main advantages include that good results could be achieved when the dataset is small (e.g. a couple of thousand tagged samples) and the computational complexity is low. The mathematics foundation can be illustrated in the following equations.
For a document d and a class c: The best class in NB classification is the most likely or maximum a posteriori (MAP) class C MAP , then: according to Bayes rule: Dropping the denominator: Apply the rules to the full document, then

2) SVM TEXT CLASSIFICATION
SVM is an algorithm that determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it [3], [7], [15]. It can be applied to any kind of data as long as the data can be converted to vector form. This means that in order to leverage the power of SVM text classification, texts have to be transformed into vectors. SVM is a discriminative classifier formally defined by a separating hyperplane. Given a labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which sets boundary of different categories. In two dimensional spaces, this hyperplane is a line dividing a plane into two parts where each class lays in one side. SVM doesn't need much training data to start providing accurate results. Although it needs more computational resources than Naive Bayes, SVM can achieve more accurate results [16].

3) LR TEXT CLASSIFICATION
LR text classification is another scheme applies statistics to machine learning based classification problems. It makes a binary predictive analysis using a logistic function. In this LR-based [6] text classification, only a binary variable is modeled by logistic function. Binary logistic regression and multinomial logistic regression are the popular algorithms used in text classification. In scikit-learn libray [3], both of them are performed. In large scale text, LR analysis of high-dimensional data poses computational and statistical challenges [17]. VOLUME 9, 2021

4) DEEP LEARNING BASED TEXT CLASSIFICATION
Convolution neural networks (CNN) have been widely applied in image processing area and achieves excellent results. Originally, CNN was not able to be applied for text classification, because the length of text is not fixed. Instead, CNN was only suitable for data with a constant length, such as images, and the convolution kernel is usually fixed at 3×3. In 2004, Yoon Kim [8] applied CNN in text classification with a novel approach, which makes the sentences to the same length by padding for short sentences and deleting for longer ones. In [18], the authors proposed character-level convolutional networks for text classification, extended the classification task from word level to character level. Another deep learning based NLP algorithm is BERT [19], the most up-to-date language representation model, which was introduced in 2018 and provides state-of-the-art results on natural langage processing tasks.

C. MULTI-LABEL TEXT CLASSIFICATION
Multi-label learning originated from the investigation of text categorization problem, where each document may belong to several predefined topics simultaneously. In multi-label learning, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances through analyzing training instances with known label sets [20]. Most traditional learning algorithms are developed for single-label classification problems. Therefore a lot of approaches in the literature transform the multi-label problem into multiple single-label problems, so that the existing single-label algorithms such as the Oneto-Rest and Binary Relevance algorithms can be applied. However, the drawback of this scheme is that it cannot predict the underlying correlation between the classes.

1) ONE-TO-REST
Traditional two-class and multi-class problems can both be cast into multi-label ones by restricting each instance to have only one label [14]. And obviously multi-label problems are more challenging. An intuitive approach to solving multi-label problem is to decompose it into multiple independent binary classification problems (one per category) [20].
In one-to-rest algorithm, multiple independent classifiers are built for an unseen instance, and the class with maximized confidence will be selected, the NB, SVM, LR could be embedded into the One-to-Rest.

2) BINARY REVELANCE
The basic idea of this algorithm is to decompose the multi-label learning problem into q independent binary classification problems, where each binary classification problem corresponds to a possible label in the label space [15], [21]. In this case, an ensemble of single-label binary classifiers is trained, one for each class. Each classifier predicts either the membership or the non-membership of one class. The union of all classes that were predicted is taken as the multi-label output. This approach is popular because it is easy to implement, however it also ignores the possible correlations between class labels [14].

3) CLASSIFIER CHAINS
In [22], J. Read, et al. proposed the classifier chains for multi-label solution and proved its advantageous in exploring tags correlation, which overcomes the disadvantage of Binary Relevance. A chain of binary classifiers C 0 , C 1 , · · · , C n is constructed in series, where a classifier C i uses the predictions of all the classifier C j , where j < i. The number of classifiers built for this approach is equal to the number of categories [14].

4) ML-KNN
In [20], the authors proposed a lazy learning approach to multi-label learning, which is derived from the traditional K-nearest neighbor (KNN) algorithm. Specifically, for each unseen instance, its K nearest neighbors in the training set are identified first. After that, based on statistical information gained from the label sets of these neighboring instances, i.e. the number of neighboring instances belonging to each possible class, MAP principle is utilized to determine the label set for the unseen instance [23].

5) LABEL POWERSET
Label Powerset is a problem transformation approach to multi-label classification that transforms a multi-label problem to a multi-class problem with one multi-class classifier trained on all unique label combinations found in the training data [14]. The method maps each combination to a unique combination ID number, and performs multi-class classification using the classifier as multi-class classifier and combination IDs as classes. This approach does take possible correlations between class labels into account. More commonly this approach is called the label-powerset method, because it considers each member of the power set of labels in the training set as a single label. This method requires at least worst case (2 |C| ) classifiers, and has a high computational complexity. When the number of classes increases, the number of distinct label combinations can grow exponentially. This easily leads to combinatorial explosion and thus computational infeasibility. Furthermore, some label combinations will have very few positive examples.

D. EVALUATION METRICS FOR MULTI-LABEL TEXT CLASSIFICATION
The classification results are measured by the evaluation metrics introduced in [14]. The predicted tags are compared against ground truth tags which were manually acquired from experienced embedded programmers. After TP (true positives), TN (true negatives), FP (false positives), and FN (false negatives) are obtained, accuracy, precision, recall, and F 1 -Score are calculated as evaluation factors. The evaluation measures for single-label are usually different from multi-label. Single-label classification uses simple metrics such as precision, recall, accuracy, and etc. Specifically, in single-label classification, accuracy is calculated as = y (i) . In multi-label classification, the prediction of each label plays a role in final evaluation. To measure a multi-class classifier, the average of all classes are taken into consideration. Two methods, micro-averaging and macro-averaging, were introduced in [14]. In micro-averaging, all TPs, TNs, FPs and FNs for each class are averaged accordingly as illustrated in (9) and (10). And the micro-average F 1 -Score can be simply obtained by the harmonic mean. Macro-averaging is straight forward, as shown in (11) and (12), the average of the precision and recall of the system on different sets are taken. Macro-averaging measures how the system performs overall across different sets of data, while micro-averaging can be of more useful when dataset varies in size. Micro-averaging Precision: Micro-averaging Recall:

III. PROPOSED EMBEDDED CODE DATASET STRUCTURE AND METHODOLOGY DESIGN
Before any machine learning classification research can be conducted, the dataset has to be thoroughly studied. Based on our experience on embedded system design, a very popular microcontroller family, PIC, is chosen to build our dataset. The proposed methodology can be extended to any microcontroller family. In this section, we present a novel dataset structure and a multi-label model for the code classifier.

A. THE DATASET BUILDING AND PREPROCESSING
For the proposed data set, there are 1000 code segments distributed in 14 categories. There are 4 categories for application scenarios (LED, LCD, Sensor, Motor) and 10 function module categories (Timer, SPI, I2C, UART, PWM, AD/DA, PORTA, PORTB, PORTC, and PORTD). We perform the classification task on machine learning model for our embedded code classifier. As discussed in Sec. II, the embedded code has two parts: code description and code content. The code description summarizes the general features of the code, such as MCU type, application scenario and some function module information, etc. The rough code information could be extracted from the code description. Obviously, the general information is only a rough narration, which is not sufficient to precisely label the code, because more hardware features, such as the I/O register, the complete function module, are hidden in the code content. However, in the code content, the application usually is not included, because the variables are randomly defined by the programmer. Based on this observation, we proposed our own dataset structure. In the proposed code data set, the code description and code content are reorganized and labeled, the tags for each code segment will be more precise. Fig. 4 demonstrates the advantage of our dataset structure.
As clearly shown in Fig. 4, the tags PIC18, LED, and AD could be extracted from the code description. However, the more precise I/O information is not in the code description, which is hidden in the code content. Since PORTD is used as I/O, the PIC18F, AD, PORTD are labeled accordingly to the code content, but the LED is ignored. By considering both parts, in our data set, we reorganize the dataset and label each code segment for code description and code content simultaneously. The advantage of this operation is to get the tags both from the code description and code content, also the relationship between application tags and function tags can be explored. The drawback of the dataset is the unbalancing of the code description and code content, as it can be seen that the amount of text in embedded code content far exceeds the amount of text in code description. The relative small amount of the text in code description will lead to the low classification accuracy compared with the code content. In the following proposed methodology, the relevance of the label is studied to solve the problem of data imbalance in code description and code content.

B. PROPOSED MODEL DESIGN
After obtaining a code database with tags marking each code segment through preprocessing, a unique parallel cascaded auto code classifier is proposed as shown in Fig. 5. Specifically, there are two machine learning channels for our code classifier, one channel is built to process the code description, and the other channel is designed to analyze the code content. The function of the first channel is to extract the rough features, such as the chip name, application scenario, and the rough function. The second channel is designed to extract the hardware feature and generate more precise tags. In this channel, the precise feature, like the chip model, function module and register information will be extracted according to the inherent hardware information in the code segment.
After training, the multi-label classifier models 1 and 2 are applied to the predictive models 1 and 2 respectively to generate the application tags and function module tags. Afterwards, the union operation is performed and the tags generated from the two channels will be combined to label each code segment. For the multi-label classifier design, the CountVec and TF-IDFVec vectorization methodologies are both tested; 7 different algorithms including the series of Oneto-Rest, ML-KNN, Classifier Chains, Binary Relevance, and Label Powerset are tested, which will be further discussed in Sec. IV.
Moreover, in embedded code, there exists a correlation between the application and function modules, as shown in Fig. 6. For example, LED application is timer and PWM related, motor application is PWM related, and sensor application is AD/DA related. Based on this, we do another exploration of the tags correlation from channels 1 and 2. The objective is to discover the relationship between the application scenario and function modules. In other words, this design will tell us which function modules will be mostly used for a specific application, also find which application will be speculated according to the function modules. There will be a compensation value between the correlated features. The correlations between applications and functions will overcome the drawbacks of the unbalancing problem in the proposed dataset and enhance the application accuracy. In the proposed methodology, the correlations of the features are trained, and the correlation score reflects the specific relationship between an application scenario and a specific function module.
The parameter for the application and function compensation in each category is defined as follows: where K a is the number of application scenarios, and K f is the number of function modules. The compensation value to the application from correlated function in each category is: where δ a is the compensation of function category to application category, λ a is the correlation value in each application category. The compensation value to the function from the correlated application in each category is: where δ f is the compensation of application category to function category, λ f is the correlation value in each application category.
The compensation for all the application categories from code description dataset is: The compensation for all the functionn categories from code content dataset is: 150192 VOLUME 9, 2021

IV. EXPERIMENTS RESULTS ANALYSIS
The experiments are carried out step by step on the Jupiter notebook plus the Scikit-Learn algorithm libraries [3]. The outcomes of the experiments are presented in two categories: the performance of the proposed dataset structure and the comparison betweenn the proposed model and the traditional machine learning approaches. The dataset is split into a training set (67%) and a validation set (33%). Figs. 7 and 8 show the characteristics of the dataset. For the vectorization approach in text classification, TF-IDFVec outperforms the CountVec because the inverse term frequency is considered as shown in (1) and (2). However for embedded code, there is a large number of self-defined variables, which is at random defined and could be seen as inverse words. This large number of inverse words will lead to the bad effect of TF-IDFVec. Therefore, both the CountVec and TF-IDFVec are tested to demonstrate the vectorization influence on the performance. As for the machine  learning classifier model, MNB, SVM, and LR utilized in One-to-Rest, together with Classifier Chains, Binary Revelance, MLKNN, and Label Powerset are tested respectively.

A. PERFORMANCE OF THE PROPOSED DATASET STRUCTURE
In this section, the overall accuracy, F 1 -Score-macro, F 1 -Score-micro, and F 1 -Score-weighted of proposed code VOLUME 9, 2021  dataset (PCD), code description only (CDD), and code content only (CCD) are compared with 14 different algorithms. Tables 1 to 3, as well as Figs. 9 to 11, demonstrate the performance of accuracy, F 1 -Score-macro, F 1 -Score-micro, F 1 -Score-weighted for the 14 algorithms for PCD, CDD, and CCD, respectively. The highest scores are achieved in PCD (Table 1 and Fig. 9) by CountVec + Binary Relevance (SVM as the core), which are 0.724, 0.845, 0.890, and 0.885, respectively. These results show that PCD provides higher performance than both CDD (results in Table 2 and Fig. 10) and CCD (results in Table 3 and Fig. 11). Under the same algorithm, the performance of PCD shows a more promising result than CDD and CCD. Specifically, CDD has the worst performance, which means that the inherent feature in the code description is not sufficient.
The accuracy, F 1 -Score-macro, F 1 -Score-micro, F 1 -Scoreweighted scores are the evaluation metrics for the overall performance of the system. However, in most cases, we want to evaluate the performance of a specific category, therefore, the accuracy scores of each category with One-to-Rest series algorithms are investigated with CountVec and TF-IDFVec respectively. In Tables 4 and 5, the experimental results show that, PCD structure can achieve higher performance in most categories except the 2 application categories, Sensor and Motor. These results demonstrate the reorganized dataset structure is more promising in the embedded   code classification. The reason for the relatively low results of the 2 applications is that, in PCD, the number of the words in the code description is much less than that of the words in the code content, which will lead to the weights of the worse performance in the application category. In order to overcome this problem, we propose the tag correlated machine learning TABLE 3. The accuracy score, F 1 -Score-macro, F 1 -Score-micro, F 1 -Score-weighted of CCD under different vectorization methods and different machine learning models.  model as shown in Sec. III. In the following part, we do the experiments to verify the refined model.

B. PERFORMANCE OF THE PROPOSED METHODOLOGY
Before demonstrate the performance of the proposed methodology, we first investigate the correlation effect between the application scenarios and function modules to verify the relationship between them. As shown in Fig. 12, also as speculated before, some of the labels are highly correlated. PORTA, PORTB, PORTC, and PORTD have very high correlation scores between each of them, which is mainly due to the IO configuration shares the same registers. Motor control and PWM also have a high correlation score, 0.22, which means that the motor control code is mostly written by PWM.   This kind of relationship can be also applied to LED and Timer, Sensor and I2C/AD. The correlation values between the tags are clearly illustrated in Fig. 12.
With the formulas discussed in Sec. III, the label results of each category, especially the application categories, are compensated with the correlation values. For the performance of the proposed tag correlated machine learning model, we investigate the accuracy enhancement in each category of the proposed dataset using One-to-Rest algorithms including NB, SVM, and LR. The CountVec and TF-IDF are tested in combination with different algorithms separately. As shown in Table 6, under the same vectorization methodology and classifier (MNB, SVM, LR), the accuracy of each correlated category is enhanced, outperforming the traditional methodologies both in application category and function category. In the application scenario of interest, for the LED, Sensor, and Motor application categories, the highest accuracies are improved to 0.937, 0.991, and 1.0 from the highest accuracy scores of traditional machine learning algorithms (0.906, 0.945, 0.964). The accuracy of LCD remains the same, because there are no correlated tags for it. The performance of the proposed tag correlated model namely Co-MNB, Co-SVM, Co-LR overcomes the bad accuracy problems using MNB, SVM, LR, which is caused by the imbalance between code description and code content. Similar performance results for CDD and CCD are tabulated in Tables 7 and 8, respectively.  In Table 9, The accuracy scores of the proposed model with 14 different classification algorithms under CountVec and TF-IDFVec are compared for PCD, CDD, CCD dataset, the results show that the accuracy scores of the 3 kinds of From all the above results, the proposed tag correlated model plus the proposed dataset are proved to work more efficiently, outperforming the traditional dataset structure and traditional machine learning models.

V. CONCLUSION
In this paper, a machine learning based auto embedded code classifier is proposed to help embedded programmers to better query high quality code segments. Multi-label methodology is utilized to precisely label the code segments. Firstly, a dataset with high quality embedded code segments of PIC MCU is built. The label classes cover 4 application scenarios (LED, LCD, Sensor, Motor) and 10 function modules (Timer, SPI, I2C, UART, PWM, AD/DA, PORTA, PORTB, PORTC, and PORTD). Moreover, a novel dataset structure is designed to overcome the insufficient feature in the code description and the code content. Secondly, a parallel multi-label machine-learning model is developed to process the code description and code content separately. A correlation filer is designed to bridge the two channels to get the relationship between the application background of each code segment and the function modules used in each application. Experiments are conducted in two parts to evaluate the performance of the proposed dataset and the performance of the proposed machine-learning model respectively. The proposed dataset experimental results show that it outperforms dataset with code description only or code centent only. In addition, the experiments of the proposed model are carried out on PCD, CDD, and CCD, comparing to the traditional models. The accuracy scores show that the correlation filter can significantly improve the accuracy scores for most of the categories. In conclusion, the proposed dataset structure and the proposed machine-learning model both show promising results comparing to the traditional methodologies.