ETFPOS-IDF: A Novel Term Weighting Scheme for Examination Question Classification Based on Bloom’s Taxonomy

Numerous earlier studies focused on the term weighting scheme to increase examination question classification accuracy based on Bloom’s Taxonomy (BT). While determining the cognitive level of the examination question, all the terms present in the question are not equally significant. Verbs are the most important parts of speech while assigning weights to the terms. However, two types of verbs may be present in the questions: BT and supporting. BT verbs have a higher impact on determining the cognitive level of a question than supporting verbs. Nevertheless, the proposed schemes of past studies assigned equal weight to both types of verbs. Therefore, this study aims to introduce the term weighting scheme ETFPOS-IDF, which assigns BT a higher weight than supporting verbs. The BT verbs were identified based on their position in the questions. Three datasets and three classifiers: Support Vector Machine, Artificial Neural Network, and Random Forest, were used in this study. Two evaluation metrics: accuracy and F1 score, were used to evaluate the performance of the proposed model. The experiment results showed that the proposed ETFPOS-IDF outperformed all the schemes introduced by earlier studies in examination question classification and achieved 0.749 in accuracy and 0.746 in F1 score. The finding of this study demonstrates that distinguishing between different verb types is significant in reducing the misclassification of examination questions. This research contributed by introducing a novel term weighting scheme in classifying examination questions based on BT. Future work may involve identifying the optimal weight for both types of verbs, evaluating the proposed scheme with a larger dataset, and comparing the performance with deep learning.


I. INTRODUCTION
Educational data mining extracts valuable information from the raw data coming from educational systems [1]. Recent years have witnessed the rise of educational data mining applications. These applications involve predicting student performance [2] and motivation [3], student modeling [4], student behavior modeling [5], and many more. In addition, The associate editor coordinating the review of this manuscript and approving it for publication was Zijian Zhang .
predicting the cognitive level of examination questions [6] is one of the educational data mining applications and the focus of this study.
Bloom's Taxonomy (BT) is a framework used in educational institutions to produce examination questions of various cognitive levels. Benjamin Bloom, an American educational psychologist, proposed this framework in 1956 [7]. The cognitive domain of BT consists of six levels, as shown in Fig. 1. These levels are ordered from low to high-order thinking, with Knowledge being the easiest and Evaluation being VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the most complex. Every examination question falls into one of these cognitive levels. However, using this framework to label the questions manually is time-consuming [8], [9], [10]. So, many past studies worked on examination question classification to automate the process using machine learning (ML) [8], [11], [12], Deep learning [10], and rule-based [13], [14] classification techniques. However, rule-based classification is not practical since new rules must be established whenever new data is added. In the classification of examination questions, there is no publicly accessible large-volume dataset, and labeling the questions needed a thorough understanding of the BT cognitive domain. So, building a large dataset is challenging and requires considerable time. Hence, this work focused on ML-based classification by introducing a novel term weighting scheme considering deep learning models often require a large amount of data to perform effectively.
The past studies on ML-based examination question classification tried to decrease the classification error by working on feature selection, feature extraction, and term weighting. Still, there are chances to increase the classification accuracy by working on term weighting since only a few studies worked on term weighting in examination question classification using BT. Term weighting is an approach to assigning numerical weight to the terms present in a document. Term weighting is inevitable since the text cannot be fed directly into the machine learning classifier to train and test the model. These numerical values represent the significance of those terms in the classification; the higher the value, the higher its importance is.
An examination question may contain BT verbs, supporting verbs, nouns, adjectives, adverbs, and many more. These parts of speech (POS) are not similarly significant while labeling or classifying the questions according to the BT. While categorizing the questions, verbs are the most important among the POS [11]. However, two types of verbs, such as BT verbs and supporting verbs, may be present in the question. BT verbs are more significant in determining the cognitive level of questions than supporting verbs since there is a strong link between BT verbs and BT levels. Fig. 1 shows some BT verbs with their corresponding cognitive levels. The past studies [11], [15], that emphasized verbs while weighting the terms did not distinguish between these two categories of verbs. However, distinguishing between these two categories of verbs may increase classification accuracy. So, this study proposed a new weighting scheme by distinguishing between the two categories of verbs mentioned earlier.

II. RELATED WORK
This section divided the past studies of examination question classification into works on feature selection, feature set extraction, and term weighting. Table 1 shows the past studies of examination question classification on feature selection and feature set extraction, whereas term weighting is in Table 2.

A. WORK ON FEATURE SELECTION AND FEATURE SET EXTRACTION
To reduce the feature space for the Artificial Neural Network (ANN), [16] investigated a few feature selection methods, such as document frequency (DF) and category frequency-document frequency (CF-DF). The experiment result showed DF as a suitable feature reduction method in classifying examination questions. In contrast, this study found CF-DF inappropriate since it attempts to exclude verbs that could exist at more than one cognitive level. Reference [17] investigated the effectiveness of term frequency (TF) as a feature selection method and found that greater than or equal to two is the most optimal value for the TF. According to [18], most past studies focused on the bag-ofwords (BOW) and syntactic features. So, this work introduced several features: keywords of the questions, headword or BT verb, syntactic and semantic. However, this research did not investigate the effects of these features in classifying examination questions.
Reference [19] tested multiple feature selection methods: Chi-Square, Mutual Information, and Odd Ratio in combination with various classifiers. These classifiers are Support Vector Machine (SVM), Naïve Bayes (NB), and k-Nearest Neighbour (KNN). The experiment results of this study showed the superiority of Mutual Information with a weighted feature size of 250. Reference [20] investigated whether individual or the combination of linguistically motivated features can increase classification accuracy or not. These features are Unigrams, Bigrams, Trigrams, POS Bigrams, POS Trigrams, and Word/POS Pairs. The experiment results of this work showed that Unigrams outperformed all other features in the single feature set experiment. The combination of Unigrams and Bigrams achieved the highest result with the logistic regression (LR) classifier in the combination of feature sets test. Reference [9] extracted and tested different forms of TF-IDF: Words TF-IDF, N-Gram TF-IDF, and Characters TF-IDF with the NB classifier. The outcome of this study showed that the N-gram TF-IDF outperformed other forms of TF-IDF.

B. WORK ON TERM WEIGHTING
To solve the overlapping keyword problem of BT, [13] introduced category weighting for the conflicting category in rule-based classification. In this work, subject matter experts assigned the weights based on the question category. Another rule-based study [22] introduced category weighting for examination questions and assigned the weights based on the highest path and lemma similarities. Another rule-based study [23] used category weighting to classify the questions. However, they used wordnet and cosine similarity to assign weight to the question category.
Some past studies [20], [21] of machine learning-based examination question classification utilized the standard TF-IDF as a term weighting scheme. Reference [24] mentioned that TF and TF-IDF work well in situations where the words are repetitive. However, which is not the case in examination question classification since questions contain fewer terms. So, [24] applied the binary term weighting instead of TF and TF-IDF. Not all the words in the questions are equally significant while classifying the examination question. So, [15] introduced enhanced term frequency-inverse document frequency (ETF-IDF) and assigned a higher weight to the verbs present in the question compared to the other POS. Nouns and adjectives received higher weight than others POS. The outcome of this study showed that ETF-IDF outperformed traditional TF-IDF. In a later study [11], the same authors came up with a new scheme called TFPOS-IDF. TFPOS-IDF assigned the highest weight to the verbs, followed by the nouns and adjectives. The experiment result showed that TFPOS-IDF outperformed TF-IDF. However, this work performed no comparison between the ETF-IDF and TFPOS-IDF.

C. RESEARCH GAP IN TERM WEIGHTING
From the above discussion, it is observable that the schemes ETF-IDF and TFPOS-IDF assigned the highest weight to the verbs, followed by the nouns and adjectives. However, there could be more than one type of verb in questions: supporting and BT. The differences between these verb types are explained below with the help of an examination question.
Sample Question: ''Suggest any (2) efforts that the organization may perform to discourage unethical behavior.'' In the above sample question, the word 'suggest' at the beginning of the question is a BT verb. Two more verbs are also in the question: 'perform' and 'discourage.' However, these verbs are the supporting verbs. The schemes ETF-IDF and TFPOS-IDF did not distinguish between the BT verb and supporting verb. These schemes assigned equal weight to all the verbs, whether BT verb or supporting verb. However, discrimination between the different types of verbs may increase the classification accuracy since BT verbs present in the questions have a substantial impact in determining the cognitive levels of the questions than the supporting verbs. Nevertheless, no past studies addressed this issue during term weighting. Therefore, this study aims to introduce a novel term weighting scheme by identifying the BT verbs from the questions to assign a higher weight to the BT verbs than the supporting verbs.

A. DATASET
This research utilized three datasets from earlier studies to train and test the examination question classification model. Pedagogy experts have already labeled these datasets according to the cognitive level of BT. The first dataset was introduced by [24] and consisted of 181 questions. Reference [24] VOLUME 10, 2022 also introduced the second dataset comprising 415 questions from multiple fields such as Computing, Social Science, Business, and many more. The third dataset used in this study was introduced by [17] and consisted of 600 questions. However, many questions of the third dataset do not contain BT verbs. So, filtering was applied to remove the questions which do not have at least one BT verb. After discarding, 387 questions remained in the dataset. Though few studies [9], [25] categorized the BT levels into low and high order, this study used the six levels of the cognitive domain as class labels in the target variable for classification.

B. PREPROCESSING
Preprocessing the text data involves many steps. These steps include the elimination of punctuation and stopwords, tokenization, stemming or lemmatization, POS, and many more. Many past studies [19], [20], [22] used these methods to preprocess the examination questions. Besides these, the BT verbs in every question need to identify since a different weight needs to assign to the BT verbs than the supporting verbs in the proposed term weighting scheme.
In this study, at first, converted the questions into lowercase. The punctuations present in the question were removed and tokenized the questions. After that, the pos tagging was applied to the terms using the Stanford tagger (version 4.2.0) [26] by following past studies [11], [15], [27]. The BT verbs need to identify to use later in the proposed scheme of this study. So, all the BT verbs were determined by their position in the questions, as shown in Table 3. The detailed process of identifying the BT verbs is discussed in section III-C2.a. After identifying the BT verbs, stop words were removed, followed by the lemmatization of the terms. The stop word list of the NLTK (version 3.6.1) [28] was used in this study, whereas the WordNetLemmatizer for the lemmatization.

C. FEATURE EXTRACTION 1) FEATURE SET
Before calculating the term weighting values of all the terms present in a question, there should be a feature set. This study used the unigram to obtain a feature set containing all the unique terms present in the dataset. Past studies of examination question classification, especially those studies [11], [15] that worked on term weighting, also used unigram.

2) TERM WEIGHTING
This study implemented three schemes proposed by past studies to compare the performance of the proposed scheme with past schemes of examination question classification. These schemes are TF-IDF, ETF-IDF [15], and TFPOS-IDF [11]. Among these three, ETF-IDF and TFPOS-IDF are the two latest schemes proposed in examination question classification. The standard TF-IDF has many variations. This study used the most optimal variant of TF-IDF identified by [29].

a: PROPOSED TERM WEIGHTING SCHEME ETFPOS-IDF
The proposed scheme ETFPOS-IDF is the enhanced version of the TFPOS-IDF proposed by [11]. The TFPOS-IDF discriminates between the POS and assigns a higher weight to the verbs. However, TFPOS-IDF does not differentiate between the different types of verbs, such as BT verbs and supporting verbs. The proposed scheme of this study discriminated between the types of verbs and assigned higher weights to the BT verbs than the supporting verbs. The ETFPOS-IDF is discussed in (1) to (3).
where w1 = 5, w2 = 3, w3 = 2, and w4 = 1. So, in (1), the BT verbs were assigned weight value 5, where 3 to the supporting verbs. The BT verbs were identified by their position in the questions, as shown in Table 3 earlier.
We analyzed the questions from all three datasets to identify the BT verbs from the questions. From the questions, we found some patterns to identify the BT verbs, as demonstrated in Table 3. There was no other way to identify the BT verbs except by analyzing the positions of the verbs in the questions. Therefore, we identified all the BT verbs in the for sentence in sentences do 7: words ← split sentence 8: x ← first word of the sentence 9: if x is in d then 10: newlist.insert((x, "BT")) 11: else 12: newlist.insert((x, "non-BT")) 13: end if 14: for word in remaining words do 15: if word is in d then 16: y ← previous word 17: if y == "and" then 18: newlist.insert((word, "BT")) 19: else if y is (Adverb & endswith "ly") then 20: newlist.insert((word, "BT")) 21: else 22: newlist.insert((word, "non-BT")) 23: end if 24: else 25: newlist.insert((word, "non-BT")) 26: end if 27: end for 28: end for 29: return newlist 30: end function questions using the BT verb's position. This process was very tedious and time-consuming.
The implementation to identify the BT verbs are illustrated in Algorithm 1. At first, the questions were split into sentences based on commas, semicolons, and full stops. After that, every sentence was split into words. The first word of every sentence was searched in the BT verbs database to identify whether it was a BT verb or not. We have collected the BT verbs database from past research [24]. The first word was added to the list with the label BT if it was a BT verb. If not, it was then labeled as non-BT. For the remaining words, every word was labeled as BT if the word present in the BT verbs database and the previous word of that word is 'and' or an adverb ends with 'ly'; otherwise labeled as non-BT. After that, we manually examined each dataset's output to ensure that the BT verbs had been appropriately identified.
Finally, we compared the aforementioned algorithm's output with the POS-tagged preprocessed questions from the preprocessing phase described in section III-B to replace the label of non-BT words with their POS. So, in this process, the BT verbs remained with the label BT; however, the label of non-BT words was replaced with the POS of that word. After that, the stop words removal and lemmatization processes were performed, as mentioned earlier in the preprocessing stage.
The calculated Ew pos (t) from (1) was used to calculate the ETFPOS(t, q), as shown in (2).
where C (t, q) represents the frequency of t in question q and i C(t i ,q) is the total number of terms in question q. Finally, ETFPOS − IDF(t, q) was calculated using (3).
The normalization technique prevents the numerical complexity of calculation during the model training process, as stated by [11]. This study normalized the weighting values of the proposed scheme ETFPOS-IDF using the L2 normalization technique. As a result, all the weighting values converted between 0 and 1. The TFPOS-IDF has also been normalized by following the [11]. The normalized term weighting values were obtained using (4).
In (4), ETFPOS − IDF(t, q) is the term weighting value obtained for t in question q.

D. CLASSIFICATION AND EVALUATION
This study used three famous machine learning classifiers: SVM, Random Forest (RF), and ANN. The extensively used [11], [15], [24] Python module, Scikit-learn (version 1.0.1) [30], was used in this study to train and test the classifier. SVM: was introduced by [31] in machine learning to solve classification problems. SVM has been widely used in text and examination question classification [19], [21], [32]. The past studies [15], [27] of examination question classification used the linear kernel of SVM, also known for higher accuracy in text classification [33]. Hence, this study used the linear kernel of SVM with the default settings of Scikit-learn.
RF: RF is one of the most effective classifiers for text classification [34], introduced by Leo Breiman [35]. Several past studies [36], [37], [38] used RF for text classification purposes. RF is an ensemble classifier based on decision trees, and it uses the majority voting technique to determine the final predicted class [34]. One of the advantages of RF is that it can handle the overfitting issue [39], which was an issue in the decision tree classifier. This study used the Scikit-learn implementation of RF with the default settings and the random state as 42 for the reproducible results.
ANN: This classifier, also known as Multilayer Perceptron, was used in many past studies [40], [41] of text classification. ANN consists of the input, hidden, and output layers. There VOLUME 10, 2022  could be more than one hidden layer in ANN. However, this study used the default setting for the number of hidden layers and neurons of the ANN classifier available in Scikit learn. As a random state for ANN, zero was used to achieve reproducible results. Besides this, 'lbfgs' was used as a solver since it converges faster with the small dataset, according to Scikit-learn [30] documentation of ANN.
Evaluation Metrics and Cross-validation: This study used accuracy and F1 score as evaluation metrics to measure the performance of the proposed model. Many past studies [17], [21], [32] of examination question classification used these metrics. To split the dataset into training and test set, we used the stratified k-fold cross-validation technique with the random state as 0. Stratified cross-validation ensures that each fold contains about the exact proportion of data points from each class label present in the dataset [42]. This study adopted the approach of [27] to use multiple k-values to achieve more reliable results. As k-values, a range from 3 to 10 was used in incremental order. The final value was determined by calculating the mean for each k-value and then the mean for all k-values.

IV. RESULTS AND DISCUSSION
A. EXPERIMENT RESULTS OF SVM Table 4 Table 5 demonstrates the experiment results of the proposed ETFPOS-IDF along with other schemes with the ANN classifier. From the results, the proposed scheme ETFPOS-IDF outperformed all the schemes in all the datasets used in this study. However, in Dataset 2, the difference in performance between the ETFPOS-IDF and ETF-IDF is identical, approximately 1.5 % in both metrics. With the ANN classifier, if we compare the average performance of each dataset, we can observe from Fig. 4 that the proposed term weighting scheme ETFPOS-IDF outperformed all and achieved an average accuracy of 0.761 and an average F1 score of 0.759. Like the SVM classifier, the proposed scheme ETFPOS-IDF shows improvement in classifying examination questions based on BT with ANN. Table 6 illustrates the experiment results of the RF for all the datasets used in this research. The results show that in Datasets 1 and 3, the proposed ETFPOS-IDF outperformed all the other schemes. However, in Dataset 2, TFPOS-IDF surpassed all, including the proposed ETFPOS-IDF. In Dataset 1, the difference in performance between TFPOS-IDF and ETFPOS-IDF is minimal, 0.5% in both   metrics. However, in Dataset 2, there is a considerable difference between these two schemes in performance, 2.3% in accuracy and 2.4% in F1 score. From Fig. 5, we can see that overall, the proposed ETFPOS-IDF outperformed all the schemes. However, the performance of TFPOS-IDF is nearly identical to ETFPOS-IDF. Fig. 6 illustrates the summary of the results. The result is obtained by averaging the results of all classifiers and datasets used in this study. From Fig. 6, it is observable that the proposed scheme ETFPOS-IDF outperformed all other schemes. The ETFPOS-IDF outperformed the closest performed TFPOS-IDF by approximately 1.9% in accuracy and 2.1% in F1 score. If we compare ETF-IDF and ETFPOS-IDF, ETFPOS-IDF outperformed ETF-IDF by around 2.6% and 2.9% in accuracy and F1 score, respectively. The difference is even higher with the TF-IDF, approximately 5.2% and 5.7% in accuracy and F1 score, respectively.

E. DISCUSSION
From the experiment result, we have found that the proposed scheme ETFPOS-IDF outperformed the other schemes of examination question classification proposed by previous studies, such as ETF-IDF, TFPOS-IDF, and standard  TF-IDF. In ETF-IDF and TFPOS-IDF, higher weights were assigned to the verbs compared to the other POS, as verbs are more significant than any other POS while determining the cognitive levels of examination questions. However, there was no discrimination between the supporting verbs and BT verbs while weighting the terms in ETF-IDF and TFPOS-IDF. Table 7 presents the weighting values of all the schemes for a question taken randomly from Dataset 3. The table shows that the difference between the BT and supporting verbs is higher in the proposed scheme ETFPOS-IDF compared to TF-IDF, ETF-IDF, and TFPOS-IDF. The reason is that in ETFPOS-IDF, the BT verbs received higher weights than the supporting verbs. The better performance of ETFPOS-IDF could result from discriminating between the type of verbs while weighting the terms.

V. CONCLUSION
This study proposed a new term weighting scheme ETFPOS-IDF in examination question classification based on BT. The proposed ETFPOS-IDF distinguished the type of verbs present in the questions and assigned a higher weight to the BT verbs than the supporting verbs. This study used three datasets and three classifiers to investigate the effectiveness of the proposed scheme. The results of the proposed ETFPOS-IDF were compared with schemes devised by earlier studies. These schemes are traditional TF-IDF, ETF-IDF, and TFPOS-IDF. Accuracy and F1 score, two widely used classification evaluation metrics, were used in this study to evaluate the results. The results of the classifiers showed that the proposed scheme outperformed all other schemes with the SVM, ANN, and RF classifiers. This outcome indicated that distinguishing the type of verbs while weighting the terms increases the accuracy significantly in classifying examination questions. The later study may utilize a larger dataset to evaluate the stability of the proposed scheme. Future studies can also identify the optimal weight difference between the types of verbs and compare the performance of the proposed scheme with the deep learning approach.