A Self-Relevant CNN-SVM Model for Problem Classiﬁcation in K-12 Question-Driven Learning

With the development and progress of science and technology, the learning patterns also evolve. In Question-Driven learning, students clarify and validate what they learn by answering questions. Such a large number of questions needs good management. A well-performed management can avoid the situation that learning materials with the same knowledge set are deﬁned into different sections due to ambiguous expressions. In this work, we propose a hybrid classiﬁcation model using the CNN-SVM that focuses on K-12 learning materials. We combine the Word2Vec feature and the hidden layer feature of CNN. In response to a current question that contains text and image, we also introduce a multi-modal preprocessing approach. The experiment shows that our preprocessing method and the hybrid model can outperform the two state-of-the-art methods.


I. INTRODUCTION
I N these days, formal education still remains as the core of learning while combining with new learning scenarios. There are multiple stages of education. The education before college is referred to as the K-12 system, including kindergarten, elementary school, middle school, and high school. In the K-12 system, the learner-centered learning method and means of self-regulated learning are effective learning types [1]. Students need to prepare for the exams and the large-scale competency tests that are used to assess learning. Thus, students often review and validate what they learn by answering questions. In our work, we focus on questiondriven learning in the K-12 system.
In question-driven learning, a typical question contains only text, as shown in Fig. 1(a). However, sometimes text cannot clearly express the abstract information. Thus, adding images can make it much easier for students to understand the abstract information in the questions. For example, Fig. 1(b) is a question about calculating the area. When classifying the question, the geometric graph in Fig. 1(b) may play an important role. On the contrary, because the learning material is a piece of mathematical document, the text feature contains not only the narrative description of the content but also the mathematical expressions, which are also vital in question classification. Take Fig. 1(c) as an example, this question is an algebraic problem. However, if we only examine the narrative description, there is no knowledge point this question can be assigned to. In contrast, after incorporating the mathematical expressions, the question can be recognized as a polynomial related problem.
It is worth noting that the new learning materials come from various sources simultaneously; the volume of the learning materials is increasing as time goes on and their content advances with time because of current affairs. Therefore, having a good way to manage the learning materials becomes an issue that needs to be studied. Well-managed learning materials can help students understand and enhance the concept of knowledge. In most learning scenarios, keywords of knowledge are emphasized repeatedly. This phenomenon not only appears in the K-12 system, but also occurs in our lifelong learning journey. With this observation, managing learning materials by categorizing them into chapters and knowledge sets using the keywords of the knowledge can be an effective method.
There are several challenges of classifying learning material. First, to get better performance of classification, a good feature extractor and a good preprocess are necessary to make the features deliver the most category-specific information. Next, a powerful classifier is needed that is able to do best separation of the different categories in the feature space. Finally, as the learning material dataset is growing simultaneously, the model should be updated to keep good performance.
Therefore, we apply a hybrid Convolutional Neural Network Support Vector Machine (CNN-SVM) Model [2] [3] for question-driven learning materials classification. The dense layer is replaced by SVM to get better global optimization in the feature space while convolutional layers are retained for feature extraction. Also, to manage the text, mathematical expressions, and images in our dataset, a series of preprocess methods is implemented. The proposed model can categorize mathematical learning materials in a question-driven learning scenario. Additionally, a retraining mechanism is designed so that the model can be strengthened simultaneously by its growing dataset. Thus, our model is self-relevant for learning material management. (c) A question that mathematical expressions are the key of the classification FIGURE 1: A typical question that only contains narrative description

II. RELATED WORK
In the past few years, the Machine Learning and Deep Learning technique have been applied to various fields, such as natural language processing, computer vision, bioinformatics, and artificial intelligence. The applications changed the world in many ways.
A typical way to manage learning materials is to use human labeling, but there are several drawbacks. First, owing to the data growth, the volume of learning materials can be considerable and ever growing. Second, human labeling may result in some bias label due to a personal subjective aspect. To address these issues, we can use a machine for classification and replace human labeling. A machine can reduce the human effort and eliminate the subjective aspect.

A. QUESTION CLASSIFICATION
Text classification is the ordinary way to preform a question classification problem. A recent work, Label-Embedding Attentive Model (LEAM) [11], has demonstrated the state-ofart performance in text classification. The fundamental difference between an ordinal text classification and K-12 math question classification is that K-12 math learning materials are both mathematical [6] and multi-media documents. As mathematical documents, math questions regularly include mathematical expressions. Moreover, some of the questions contain pictures of geometric figures to help the student understand the problems. On the contrary, the ordinal article to be classified may have more words in the documents and they do not contain any mathematical expressions.

B. MULTI-MODEL DOCUMENT
Many methods have been proposed for handling or classifying documents that contain different types of data. Some of the methods utilize relevant text information to enhance image features [7]. In [8], the authors introduce an approach that combines the textual and visual statistics as a single feature vector. The approach applies color histograms and a dominant orientation histogram to represent the images into vectors, then combines the vectors with a text vector generated from Latent Semantic Indexing. The authors prove that integrating features from images can produce a better result than the text-only model. The work also gives us the aspect of combining multi-modal data into a single vector. [9] present a model using a Bayesian network for classification of structured multimedia documents that contain text and images. The Bayesian network can model the relationship of each element in the document, thus classifying them with structural and content information. It utilizes TF-IDF for text and RGB histogram for image scoring. The above approaches are not suited for our dataset because our images are mostly geometric figures, which only consist of a white background and black lines. Hough transform is utilized in previous works, which are focused on extracting features or understanding graphs of mathematical questions [5]. As a result, we apply the concept from [5], that of using the Hough transform to extract the features from the diagram of geometry questions and apply Word2Vec for word embedding instead of TF-IDF. This gives us the similarity and relationship between words instead of pure statistical features.

C. HYBRID CNN-SVM MODEL
There are several hybrid models that combine the Convolutional Neural Network with the Support Vector Machine. They are mostly used in computer vision tasks. [10] presents the model for invariant and perceptual texture mapping. The model contains a three-layer Convolutional Neural Network that connects to many Support Vector Machines. The result shows that it outperforms an individual neural network and SVM. [2] proposed a hybrid system that uses the Convolutional Neural Network to train and extract the feature, and a Gaussian-kernel Support Vector Machine is trained using the features from the Convolutional Neural Network. The model is used for generic object categorization, which is also a computer vision task. [3] combines the Convolutional Neural Network and the Support Vector Machine for visual pattern classification. The method uses the Convolutional Neural Network to learn features. Then it uses a Support Vector Machine to provide an optimal solution for the learned feature space. The proposed architecture is similar to the one in [2]. They both utilize the Convolutional Neural Network to learn and extract the feature from the dataset and apply the Support Vector Machine to find a hyperplane in the feature space. The part of document representation of CNN R The document representation C The set of classes T P The number of documents that is true positive T N The number of documents that is true negative N The number of documents

III. SELF-RELEVANT CNN-SVM MODEL
The proposed model is separated into three parts: preprocess, document representation, and classifier. There are four processes in the preprocess part. First, a geometric features process transfers geometric features to keywords. Second, in mathematical expression process, patterns (such as regular expression) are used to capture commonly used formulas and then transferred to plain text keywords. We integrate those keywords from above processes with the noise eliminated plain text, and then do Chinese word segmentation as a tokenization process.
In the documentation representation part, Word2Vec is generated by using the tokenized text documents, and then CNN is used as a feature extractor. The hidden layer feature of CNN and the word vectors are combined as the text representation of the documents.
In the end, SVM is used as the classifier. The process of our classification system is depicted in Fig. 2   FIGURE 2: A typical question that only contains narrative description.

A. PART 1. PREPROCESSING
Our dataset, D, is composed of text (D text ) and images (D image ). First, the features in D image are captured and represented in text. Then these features are integrated into D text . After that, D is converted to a single modality dataset D text for classification.

1) Image preprocessing
The purpose of image preprocessing is to extract the geometric features from the images and convert them into keywords. In our dataset, since our learning materials images may contain geometric features, for example, composed of lines and points, it is not suitable to use the RGB histogram method to extract the features, as the images are mostly black and white.
In the image preprocessing part, the images are sent to the ConvNetJS convolutional neural network (CNN) model and the Hough transform. ConvNetJS is a Javascript library for training a deep learning model which is used to classify the images. The mathematical signs and fractions are images in the question set. These images are treated as noise, which is filtered to obtain a clear view of the question set.
After sieving about 600 images from 800 images with classification characteristics into 20 classes, we perform the image classification. The 200 removed images do not have knowledge feature, such as a symbol image. All of those images are transformed to 32 × 32 images. The network architecture is ConvNetJS CIFAR-10 Demo and the learning rate is set to be 0.00001 in our implementation. The network architecture is the same CNN in our hybrid model. After the classification is complete, we add the output of the model to the sample question.
First, each image in D image is used to perform edge detection to get a clear contour. Then, the Hough transform is applied to the image from the data to extract the contour. The Scikit-image library provides built-in functions for us to utilize in our implementation. A set of rules R image is defined for the composition of the extracted geometric features f image . The processing of the image is denoted as H( d m , r n ) , where m is the number of the images and n is the number of the rules. Fig. 3 shows two examples of our image preprocessing methods; we can infer that the first image contains a coordinate system by detecting the x-axis, the y-axis, and an origin O, which can be represented by the keyword "coordinate system". The second image contains a coordinate system and a straight line, which can be represented by keyword "linear equation".
The validation accuracy of the ConvNetJS CNN model is about 0.8; this means that some of the images may be misclassified. The keyword of the model output is not a crucial factor, because, as we shall see later, the Hough transform in the preprocessing part has excellent image detection accuracy. Moreover, the output of the text preprocessing part will improve the image classification. Therefore, the classification error resulted from CNN will only cause data noise rather than a detrimental consequence.

2) Mathematical Expressions Preprocessing
As mentioned above, many of the mainlines in our learning materials are mathematical expressions. We observed that some category-specific formulas in learning materials have certain patterns since they are bounded by a limited knowledge set. Therefore, several regular expression patterns are used to parse the mathematical formulas before removing the punctuation. A different set of rules R RE is defined for mathematical formulas for data in D text . The documents are checked to see if there is a mathematical formula that matches any of the rules r ∈ R RE . If a part of mathematical expression in d i ∈ D text matches r, the matched expression will be replaced by keyword f RE . The mathematical formula extracting process is represented as M ( d m , r n ) .
FIGURE 3: shows that we application two of our rules which capturing coordinate axis and linear function on coordinate axis in image preprocessing. After binarize and edge detection. We utilize Hough transform to extract the lines in the math figure. The rules are applied to the detection of some meaningful features.

3) Text Noise Elimination Preprocessing
This preprocess removes the noisy information in our dataset. For example, in Chinese characters, a number of punctuations does not provide additional information; therefore, they offer no contribution to our classification. Consequently, these punctuations are treated as noise and are removed from the dataset before the classification. With this approach, our dataset is refined by removing noisy information.

4) Tokenization
Tokenization is the process of cutting the corpus into a sequence of terms. The sequence of terms helps us learn the relationship from them. Jieba is the library applied in our tokenization process.
After text and image preprocessing, two modalities are merged into a single modality dataset D text .
Jieba is performed to split our text corpus into a sequence of Chinese vocabulary. The tokenized D texr is denoted as T, consists of a sequence of tokens ( t 1 , t 2 , · · · , t n ) .

B. PART 2. DOCUMENT REPRESENTATION
Before classification, features need to be extracted from the preprocessed data and the feature vectors need to be generated for the classifier. Word2Vec and CNN feature extractor are applied in this part. Different from the traditional statistic methods of feature extraction, both of the applied models contain Neural Network, which allow us to extract the higherlevel features by means of model training, such as the latent relationship between the word vectors. Our document representation concatenates the features of both Word2Vec and CNN. The following is a brief introduction of the techniques:

1) Word2Vec
Word2Vec is a model for producing word embedding. It can learn the relationship between words. Moreover, Word2Vec can calculate similarities between words, and project them into vector space. The similarity can be evaluated via Euclidian distance or Cosine similarity. Gensim is the library utilized to generate Word2Vec. We apply continuous bag-ofwords(CBOW) in our system because it is less time consuming. The input of the model is the tokenized data T from the previous processes. The learning materials have been tokenized to a sequence of Chinese vocabulary. The learning materials are used as the corpus and the dimension of word vectors is set to 250. The learning material corpus consists of a set of vocabulary, where vocabulary i is denoted as v i . The dictionary of learning material corpus is denoted as where c is the number of all of the vocabulary in Dict lm . The input (t 1 , t 2 , · · · , t n ) is the sequences from T. After the training process, we obtain Dict lm and word vectors W = {V 1 , V 2 , · · · , V c } ,where V i is the word embedding of vocabulary v i . To represent all tokens in d i ∈ T , a bag-ofword of W is applied to part of the representation of T. The Word2Vec representation of T, R W , is a concatenated vector of V i and zero padding z. Here, rw i is a c×20−dimension document vector. By Dict lm vocabulary order, the wv i in rw i is V i if v i exists in d i ; otherwise, wv i is padded by z.
2) Convolutional Neural Network CNN is a multi-layer architecture that trains by the data and learns the higher-level features, such as local connectivity and special structures. The last layer will categorize the result. In this stage, we are going to extract the categorical local information of context. Our network architecture consists of an embedding layer, a couple of convolutional 2D layers, a max pooling layer, and a fully-connected layer. The architecture of our model is shown in Fig. 4 . At the beginning, the tokenized data T will be fed into the embedding layer, where W is set as the initial weights of the layer. While feeding the input data, the embedding layer will produce its corresponding word vector and move on to the next layer. The next layer consists of two Convolutional 2D layers. In both layers, the kernel size is set to 3 × 3 and the number of filters is set to 100. The decision of the parameters is made according to prior experiments. The kernel will extract the local feature from the input word vectors. After the Convolutional layers, a max pooling layer with pooling size 2 × 2 is added for down sampling. Max pooling layer passes the maximum value of the window to the next layer. After that, the network is flattened and connected with a fully-connected layer with 250 hidden nodes. This layer provides part of the document representation for our final classification. The last layer is a softmax layer, which is used to train and validate the network in the training process. After the training process of the CNN, the weight from the last fully-connected layer is extracted to generate the feature vector R C as part of the input of the classifier.
The document representation R we input to classifier is the concatenated vector of R W and R C ,where R = {r 1 , r 2 , · · · , r m }.

C. PART 3. CLASSIFICATION
After the features are extracted and the feature vector R is generated from the previous process, the document will be classified into categories. The classical concept of classification is as follows.
C is the set of classes, and d is the true class of the data. SVM is utilized as our classification model, which is implemented by importing the Sci-kit Learn library. Given a set of training data with m points of the form, where r i is a vector that represents the data, and y indicates the class that r i belongs to. The goal of SVM is to find a hyper-plane that separates the data into the classes where they belong.

D. RETRAINING MECHANISM
Due to the ever growing data from the volume of learning materials, the classification model needs to update with the latest dataset to maintain the best-suited classification. Our model provides a retraining mechanism to keep the classifier up to date. Generally, the input questions will be classified by the trained model. After the new classified learning materials reach a certain quantity, the model should automatically initiate the retraining mechanism and our model will be tuned FIGURE 4: The architecture of our Convolutional Neural Network model. We utilize a 2D convolutional layer to learn a higher level feature, and a max pooling layer to reduce the computation.
by the new data. A retraining mechanism is implemented to achieve this goal. First, a threshold is defined to determine if the retraining is necessary. After the size of new input data exceeds the threshold, the retraining mechanism will automatically be triggered and initiate the training process of the model with the latest dataset. However, the classifier will be retrained completely by all of the updated data, because SVM cannot be partially retrained. Through this retraining mechanism, the model could constantly improve with the growing dataset.
Algorithm 1 Algorithm of the retraining mechanism. Our dataset contains 1400 mathematical questions and some of them consist of an image; the total number of images is 600. There are 13 categories of these mathematical questions. Each result is obtained by running the experiment 10 times and computing the average of them. To test the performance boost of the multi-modal preprocessing methods, the classification result is tested starting from text only, gradually adding the mathematical expression feature extraction and then the image features integrating with text.

A. DATA SETTING
The dataset is divided into two parts: 80 percent for training, and 20 percent for testing. In order to avoid overfitting of our CNN model, in CNN stage, we split the training data into two parts: 10 percent for validation and 90 percent for training. After that, the training data is fed into the SVM to train the classifier. After the training process, the testing data in our trained model are used to evaluate the performance of our model.

B. HYPERPARAMETERS
We are interested in identifying the combination of parameters that results in the best performance of our model. Through experiments, we have found that parameters like dropout rate, embedding dropout rate, hidden dimension, and embedding dimension affect the performance the most. For other parameters such as the batch size and learning rate, they are set to be 32 and 0.0001 by default while training. First, we conduct a grid search on the embedding dimension and hidden layer dimension, and the results are shown in Fig. 5 and Fig. 6. As shown in the figures, the embedding dimension and hidden layer dimension seem to perform the best in the range of 250 and 300, respectively. To identify the best combination of the embedding dimension and hidden layer dimension, we set the embedding dimension and hidden layer dimension to be in the range of 150 and 350 and in the range of 200 to 400, respectively, for a double-parameter grid search, as shown in Fig. 9. As can be observed in the figure, the best combination of the embedding dimension and hidden layer dimension is (300, 250). In addition, another  Fig. 8. As can be seen in the figure, the best combination of the CNN dropout and embedding dropout rates is (0.5, 0.3) with rbf kernel and Adam optimizer.

C. BASELINE METHODS
Our model will be compared with the state-of-the-art model, LEAM [11]. LEAM demonstrates text classification using both label embedding and word embedding to get the wordlabel attention for text classification. In addition, we are also interested in how the classic text classification model performs in our question classification problem. Therefore, we also include CNN and Bi-LSTM as our baseline methods. The CNN classifier has the same network topology as our CNN feature extraction module. LSTM is a well-known neural network in natural language processing field [4]. The Bi-LSTM utilizes bi-directional LSTM. In the Bi-LSTM classifier, the number of Bi-LSTM cells is set to be 50 with a 1D max pooling layer and a dense layer for classification. Both the CNN and Bi-LSTM have an embedding layer after the word sequence input T .

D. EVALUATION METRIC
We evaluate the classifier's performance by calculating the accuracy score (notations are defined in Table 1).
where "true" means that the document is predicted correctly, positive means that the document is predicted to belong to positive class, and vice versa.

E. EXPERIMENT RESULT
In Table 2(a), it can be seen that the CNN-SVM hybrid model outperforms the state-of-art text clasification method, LEAM [11]. On the contrary, it can be seen that the CNN-SVM hybrid model outperforms Bi-LSTM by almost 10% and CNN by 15% in terms of accuracy. This indicates that our method is more suitable for the question classification dataset. The SVM is good at finding general optimal solution for the learned Word2Vec feature space. Combining SVM with CNN can take advantages from both of the models. Table 2(b) shows the performance of the CNN-SVM hybrid model with different feature processing methods. The experiment result shows that text is the most important feature in our model, and the mathematical expression features help our model perform better. It can be seen that our image processing method does not provide significant performance improvement in our experiment. This is due to the fact that keywords obtained from those images in the learning materials are likely to appear in the text description of the questions. Additionally, the images that contain general mathematical meanings are not that common in the K-12 learning materials, as many images merely illustrate the scenario of the questions.
After checking the misclassified documents, most of the mistakes happen on the labels that have high correlations, such as two sub-concepts of single variable algebra, or question of proportional with two variables or multiple variables. In addition, many misclassified documents are recognized as ambiguous between the predict label and the true label. Moreover, in some of the misclassified documents the predict labels are actually the correct ones because the original labels are incorrect. Hence, our method correctly classifies the document most of the time as long as the document is not ambiguous among multiple labels with high correlation.
The above experiments prove that combination of the multi-modal preprocessing, Word2Vec, and the hybrid model can improve the performance of the classification. Our model can reduce the labelling human effort and eliminate the subjective mistakes.

V. CONCLUSION
In this work, we introduce a hybrid CNN-SVM for classification of the mathematical learning materials in questiondriven learning scenarios. Our proposed model can handle learning materials with multiple modalities such as text and images simultaneously. Furthermore, we extract the mathematical formulas in text as the features for classification. The result shows that though combining multiple modalities does not improve the performance of the classification, our feature extraction method of mathematical expression does improve the performance of the classification.
Our model can accurately classify the data into chapter and knowledge sets. The experiments show that our hybrid CNN-SVM model is effective, and it outperforms the CNN and Bi-LSTM model. The model can also deal with the crosstopic labeling situation and retrain itself along with the data growth.

VI. FUTURE WORK
We believe that in the future, this classification model can be utilized in other subjects, such as English, science, and language questions. Moreover, by integrating it with student learning portfolios, the model can be extended to a recommendation system for learning. It can analyze the students' learning process, and provide feedback for students. The system can track the weaknesses of the students and supply them with the suitable learning materials to improve their learning performance. The recommendation system can turn into a customized personal learning assistant that is specialized for every unique student.
This system can give help to not only to students but also to instructors. Instructors can manage and modify the lessons using the collected data from the students. For example, it offers a more efficient way to schedule the content of the quiz if the system can recommend learning materials along with the given knowledge set from the instructors.
We can further apply semantic analysis to the learning materials. Since the machine understands the questions, it can provide corresponding answers to the questions. Hence, we can extend the classification model in a question-answering system for K-12 students. With the combination of the recommendation system and the question-answering system, we believe that it can offer significant assistance to students for learning progress.