Bloom’s Learning Outcomes’ Automatic Classification Using LSTM and Pretrained Word Embeddings

Bloom’s taxonomy is a popular model to classify educational learning objectives into different learning levels for three domains including cognitive, affective and psycho motor. Each domain is further detailed into different levels. The cognitive domain includes knowledge, comprehension, application, analysis, synthesis and evaluation levels. In educational institutions, designing course learning outcomes (CLOs) as per different levels of Bloom and mapping of assessment items on designed CLOs is an important task — every semester, faculty and administrators read thousands of statements to complete the tedious task of such mapping of CLOs and assessment items into Bloom’s levels for an improved student learning. This paper proposes LSTM based deep learning model to perform classification of CLOs and assessment items in different levels of Bloom in cognitive domain. Although, there has been some attempts in the literature to automatically assign Bloom’s taxonomy category using keywords-based approach but it suffers from the problem of low accuracy and overlapping of keywords. Initially, when we performed keywords-based approach on our datasets we achieved an overall accuracy of 55% for classification of CLOs and assessment items into Bloom’s taxonomy. The proposed model predicts Bloom’s level for CLO and assessment question item, respectively. The proposed model is simple in terms of the architecture as compared to other deep learning models reported in literature and achieves classification accuracy of 87% and 74% on CLOs and assessment question items, respectively. The proposed model obtained 3% increase in overall accuracy comparing to an existing study for the same task. To the best of our knowledge, this is first attempt towards applying deep learning on classifying educational objectives in Bloom’s levels.


I. INTRODUCTION
Thinking ability is considered as a heart for all learning activities, without which no one can learn [1]. Every educational institution always tends to evaluate this thinking process by teaching, understanding, quality assessment and evaluation to ensure maximum learning of the students. First, the teaching and understanding in this process is carried out by teachers by designing the teaching material and a set of some course learning outcomes (CLOs) focusing student's thinking ability [2]. Second, the quality assessment is done by accreditation bodies and regulatory organizations [3]. Finally, The associate editor coordinating the review of this manuscript and approving it for publication was Zijian Zhang . the evaluation is done by conducting written examination. The educational institutions including teachers and accreditation bodies need hierarchical levels to differentiate thinking behaviors for students during the learning process. This also helps to understand what teacher is communicating and what student is perceiving during the learning process [4]. In 1956, Benjamin Bloom and a group of educational psychologists developed a classification system of different thinking behaviors important in learning namely, ''Bloom's Taxonomy'' [5]. Moreover, Krathwohl et al. in [6] have defined this taxonomy as ''Taxonomy of Educational Objectives''.
Bloom's taxonomy classifies thinking behaviors into three different domains: First, cognitive (related to mental behaviors); second affective (related to emotional behaviors) and finally psychomotor (related to physical behaviors). Among all three categories, the cognitive domain has got much attention due to its high applicability in educational institutions [7]. The cognitive domain is further divided into six different hierarchical level structure of different thinking behaviors / levels involved in student's learning process (See Figure 24 in appendix). At each level, this approach has different keywords / action verbs associated which differentiate these levels from each other. (See Figure 2). Later on, the cognitive level was revised by few other experts and the revised Bloom's taxonomy was proposed. The levels of this revised taxonomy are mentioned in Figure 1. For the rest of this research study, we will only refer to cognitive domain of the revised taxonomy whenever there is a discussion on Bloom's taxonomy. Usually, the classification of course learning outcomes (CLOs) and questions on different levels of cognitive domain is done manually by teachers and accreditation bodies according to their own domain understanding. This is actually time consuming and often leads to mistakes due to human biasness. So there is a need to automate this process and this approach lies under emerging area of text classification. Indeed, there are some research works in the past that attempted to automate this process using keywords searching, natural language processing and machine learning techniques [8]- [15].
Previously, few studies [8]- [10] have employed keywordbased approaches to classify questions in Bloom's taxonomy levels. Though, the approach showed promising results, however it suffers from one major weakness of overlapping of several keywords in more than one Bloom's taxonomy cognitive levels [16]. (See Figure 2).
Consider the following example, 1) ''Define the scope and importance of technical writing in academic and professional life'' 2) ''Define the basic principles and concepts as they relate to practical accounting problems'' Above, are the two CLO statements which belong to two different levels. CLO1 belongs to understanding level and CLO2 belongs to remembering level. However, if we analyze both CLOs we observed that both contains same action verb/keyword ''Define''.
Consider another example, 1) ''In your own words, how would you define transferable skills'' 2) ''Define compound interest'' Above, are the two questions which belong to two different classes. Q1 belongs to understanding level and Q2 belongs to remembering level. However, if we analyze both question statements we observed that both contains same action verb/keyword ''Define''.
This is the major drawback of automatic keyword-based approaches for classification of CLOs and Questions. For human, to differentiate above CLOs or questions are easy to identify in their respective Bloom level but for machine using simple keyword-based approach it is erroneous. Therefore, the recent studies [8], [17]- [19] employed alternative machine learning-based approaches to classify the CLOs and questions. However, to the best of our knowledge, still these studies suffer from low accuracy [18]. This is because these studies employed existing conventional ML approaches. In addition, none of the existing proposed models are used in practice, due to the limited accuracy [13].
Hence, further improved automated text classification approaches are needed to improve the performance of existing studies in this domain. Recently, deep learning has shown very promising results as compared to traditional machine learning algorithms specially in the area of text classification [20]. Another problem that exists in this domain is the lack of already tagged datasets of CLOs and questions into cognitive levels. Most of researchers, working in this domain are developing their own datasets and those are publically available. Therefore, in this research project we have also developed our own dataset of CLOs classification into Bloom's taxonomy with the help of domain experts.
Our Research Contributions are: 1) The main contribution is an improved automatic Bloom's taxonomy CLOs and exam questions classification model utilizing the proposed deep LSTM model in combination with contextual domain embeddings. 2) Preparation of academic course CLOs and Questions' dataset, manually tagged by subject experts into one of six levels namely Remembering, Understanding, Application, Analysis, Evaluation and Creating.
3) The proposed approach addresses the issue of overlapping keywords in Bloom's taxonomy levels. 4) Although, the proposed improved classification model is the combination of existing techniques from the literature but it shows significant improvements to solve the given problem. The major contribution is towards the applied side of solving the problem of this research study. The rest of the paper is organized as: Section II presents the literature review in the field of automatic classification into Bloom's taxonomy and introductory details about some recent approaches. Section ||| details the methodology used for proposed system, its construction, working and evaluation. Section IV, shows the experimental results of the proposed system in comparison with state-of-the-art models. Section V, highlights major insights from the experimental results. Section VI concludes the research study and Section VII presents the future work.

II. LITERATURE REVIEW
This section explains major background studies related to Bloom's taxonomy in general, existing approaches for automatic classification of CLOs and questions into Bloom's taxonomy and overview of relevant techniques including text classification and deep learning.

A. BLOOM's TAXONOMY
Chang et al. in [2] explained that the Bloom's taxonomy (cognitive domain) is being actively used by educational institutions (i-e: teachers and accreditation bodies) to define course learning outcomes (CLOs), to design teaching material and to assess examination questions in order to find out problems in student's learning. However, Krathwohl et al. in [6] introduced revised version for Bloom's taxonomycognitive domain (See Figure 1). The major difference was, in revised version the Evaluation is on second highest level as compared to original taxonomy. Moreover, the new category Creating is on the top of revised taxonomy. The revised taxonomy was developed to show the intersection and different type of levels involved in learning. In this study, we have considered revised Bloom's taxonomy for the classification of CLOs and examination questions.
Recently, Swart et al. in [21] explained the importance of using Bloom's taxonomy for understanding and classification of CLOs. The authors observed that cognitive domain defines different thinking behaviors from simple memory recall to complex reasoning skills in terms of student's learning (See Figure. 24). They analyzed course learning outcomes (CLOs) of an Electronic Fundamental course offered in two universities in Romania and South Africa, respectively. The results from this study indicated that the first two levels of cognitive domain (remembering & comprehension) contributed overall 58% to the total CLOs. Meanwhile, the application and rest of the levels contributed 27% and 15% to the total CLOs, respectively. Also, Rahmatih et al. in [22],used the Bloom's taxonomy to analyze the student's questioning skills. The major take away from this study was that the questions were asked in the cognitive domain. Hence, it shows the important of cognitive domain in the education sector.
Moreover, Atiullah et al. in [23] used Bloom's taxonomy to evaluate the availability of higher order thinking skills in reading comprehension questions. The authors collected a total 158 reading comprehension questions from 15 texts of English textbook for Grade X. The authors classified complete questions into Bloom taxonomy using manual intuitive approach. The results indicated that majority of these questions (i-e: 134) were categorized at remembering level and only 24 questions out of 158 were categorized at other higher levels (i-e: comprehension, application). It was concluded that English textbook of Grade X is lacking high order thinking capabilities as 85% of reading comprehension questions were below the comprehension level in the Bloom's taxonomy. Hence, the above studies and discussions have clearly explained the importance of using Bloom's taxonomy in education. Wijanarko et al. in [24], proposed the use of Bloom's taxonomy while generate questions from the unstructured content. The whole idea was to evaluate the generated questions into the Bloom's taxonomy levels, if it satisfies the learning outcomes.

B. TEXT CLASSIFICATION
Text classification is becoming much more important these days due to exponential growth in complex texts generated by Internet, which requires an in-depth understanding of machine learning algorithms that can automatically categorize texts into many real-world applications. Most of the breakthroughs in text classification are due to the efficiency of recent techniques in understanding complex relationships in text data [14]. Moreover, text classification has been actively used and studied in applications like Information Retrieval [15], [25], [26] and Information Filtering [27], [28].

C. AUTOMATIC CLASSIFICATION OF CLOs AND EXAMINATION QUESTIONS INTO BLOOM's TAXONOMY
Automatic classification of educational objectives into Bloom's taxonomy can be defined as a task of classifying CLOs or exam questions from course material or examination papers, respectively. In previous studies, several researchers have tried to solve this problem using automatic ML-based and NLP-based techniques. The literature highlights two major approaches in this domain namely, Keyword-Based [1], [2], [29] and Text-Classification-Based [13], [19], [30]- [33].

1) KEYWORD-BASED APPROACH
Bengio et al. in [17] applied keywords based approach by searching keywords for each level. (See Figure 2) in the Appendix for list of available keywords). The keywords were searched for different CLOs and Question statements for the sake of testing. Without the use of machine learning / text classification approach the authors managed to get an accuracy of 75% for only Remembering level. Moreover, for Evaluating level the authors got only 25% accuracy. The average accuracy for all 6 cognitive domain levels was just 47%. Omar et al. in [1] applied NLP-based technique to identify and use important keywords. After identification, a rule-based approach was used by authors for identification of the desired cognitive domain level. The authors applied this approach on 100 questions (70 training set and 30 test set items). The authors reported very low accuracy as the training set is not enough to learn the rules. In Keyword-Based approach, the accuracy is good up to two basic levels i-e: Remembering and Comprehension because the CLOs / Question statements are very simple and straight forward and there is no issue of keyword overlapping.
Although, all the researches above are done with the general keywords of Bloom's taxonomy (cognitive domain). But Christian et al. in [34], also prepared the list of new verbs for the Bloom's taxonomy. The new proposed verbs are 84 in total, out of which 34 are technical verbs. They haven't made the list public yet.

2) TEXT-CLASSIFICATION-BASED APPROACH
Zhang et al. in [19], applied machine learning algorithms to classify questions related to the computing education into the Bloom's taxonomy. The total questions were 504 and manually annotated by the education experts. The authors reported the highest accuracy of 82% on the test set where the ratio of training and test set was 90 and 10, respectively.
Manjushree et al. in [35] applied deep learning based models (CNN and LSTM) for the classification of assessment items into cognitive level of Bloom's taxonomy. The authors collected and manually labelled the dataset of total of 844 instances into six levels from the software engineering course. Furthermore, the authors used the train-test split ratio (70-30%) to assess the performance of CNN and LSTM models. The authors got the highest performance of 80% on the test set using CNN model. This is the most recent work done in this domain.
Mohammed Manal et al. in [36] applied three ML-based classifiers (KNN, Logistic Regression and SVM) with two feature engineering techniques (TF-IDF, Word2Vec) to classify questions into cognitive domain of the Bloom's taxonomy. The authors used two datasets. One was manually collected and labelled by them with 141 questions into six Bloom's taxonomy levels and second dataset was used from the literature by [31] with 600 questions divided into same six Bloom's taxonomy levels. The authors achieved satisfactory results on both datasets using train-test split approach. The average accuracy obtained for first dataset was 83.7% and for the second dataset they achieved 89.7%. This was a significant improvement in the results achieved by [31] on the same dataset although, the approach was promising but less generalizable and less scalable. As soon as, the amount of data will increase it will become a bottleneck for this approach.
Hoeij et al. in [30] applied ML-based SVM classifier for examination question classification into Bloom's taxonomy. They achieved more than 80% accuracy because the dataset was very small and almost all the questions written were according to keywords of Bloom's taxonomy cognitive domain. The major drawback in this study was it is not necessary that always the examination questions contains these keywords. Osadi et al. in [13] combined multiple classifiers using ensemble approach for the question classification into Bloom's taxonomy. Since, their dataset was too small which contained only 100 questions so they proposed to further continue this approach on large dataset in the future work.
Yahya et al. in [31] developed a classification model to classify short essay questions on Bloom's taxonomy cognitive domain on two veterinary courses. The authors achieved an overall accuracy of 65%. Yusof et al. in [32] proposed different machine learning based methods to classify question into cognitive domain levels. The authors experimented with different feature engineering and ML models combinations and achieved highest accuracy of 76% with SVM classifier. Furthermore, Zhang et al. in [33] proposed a technique called Category Frequency-Inverse Document Frequency (CF-IDF). The proposed method used ANN (Artificial Neural Networks) to utilize the frequency of each class label in order to classify questions. The authors achieved only 60% accuracy for first three levels of cognitive domain.

D. DEEP LEARNING
Deep learning algorithms and architectures have done excellent advances in the fields like computer vision. Moreover, the recent support of deep learning for NLP can be observed from last 5 years articles indexed in WoS. (see Figure 3). In recent years, deep learning approaches depending on dense vector representations are producing promising results on variety of NLP tasks including text classification [20]. The major reason for this success is the use of word embeddings [37], [38] and deep neural network architectures [39]. Deep learning models supports automatic feature representations learning based on data as compared to traditional shallow machine learning models where the features representation learning is based on hand-crafted features. Collobert et al. in [40] shows that deep-learning models are leaving behind most of the traditional state-of-the-art approaches for text classification, named-entity-recognition (NER) and Part-of-speech (POS) tagging. Hence, the deep learning models are increasingly being used in NLP problems like machine translation, sentiment classification and text generation [18].

1) WORD EMBEDDINGS
This is the most often first data processing layer in deep learning models, which helps network to learn automatic feature representations learning by converting raw text into continuous real numbers. These word embeddings or dense vector representations work on simple hypothesis that words having similar meanings tends to occur in same context. The similarity between different word vectors is measured using cosine similarity [41]. Therefore, these embeddings are fast and efficient in order to capture context in most of the state-of-the-art core NLP tasks [42]- [44]. These representations are mainly learned through context in unsupervised manner [18].
The most efficient and actively used word embedding technique is word2vec, which comprised of two models namely continuous-bag-of-words (CBOW) and Skipgram models [45]. Bengio et al. in [17] explained that once these individual word representations are combined into sentence representation using joint probabilities for word sequences, make efficient representation for unseen sentences if the sentences are of same context because the network have already learned those representations. Moreover, the learning of word embedding can be done by using some pretrained embeddings like Glove [46], Elmo [47], BERT [48] and FastText [49]. These embeddings form the foundations for many current approaches in NLP using deep learning.

2) RECURRENT NEURAL NETWORKS
The term ''recurrent'' refers to perform same action/ computation over sequence of data (i-e: sequence of tokens in text data). RNNs [50] is basically used for processing sequential information where each computation is performed over sequence of tokens and each next computation is dependent on previous computations and its results. In general, a fixed size vector is created to represent sequential information of tokens and in this way RNNs store information for previous computations to be used for current processing. In recent years, RNNs are being widely used in major NLP tasks namely language modeling [51], [52], machine translation [53], [54], speech recognition [55], [56] and image captioning [57]. RNN suffers from the problem of vanishing gradient [58] which makes it difficult to learn longer sequences.
To solve this problem, specifically for NLP problems different variants for RNN are used such as Long Short-Term Memory (LSTM) [59]. LSTM [60], [61] has additional ''forget'' gate as compared to simple RNN model. This mechanism helps it to overcome the problem of vanishing gradient as discussed above. This architecture consists of three gates namely, input, forget and output gates. The hidden state is calculated using combination of these three gates. Other variants of RNN are Gated Recurrent Units (GRUs) [62], BiDirectional LSTM and GRUs [63]. This architecture of LSTM neural network is explained in detail in section III-G.
Previous studies discussed in Section II-C showed very good results in automatic classification into Blooms taxonomy. To the best of our knowledge, no study has used the recent word embeddings and deep learning models based approach for this problem. Hence, in this research we have adapted these recent approaches to perform automatic classification of CLOs and questions into Bloom's taxonomy.

III. METHODOLOGY A. OVERVIEW
This section presents the overall research methodology which we have used to classify CLO or Question into six distinct classes namely, ''Remembering, Understanding, Application, Analysis, Evaluation and Creating''. Figure 4 depicts the overall research methodology.
The methodology is logically divided into two components namely, 1) Domain Understanding and Data Acquisition, 2) Construction of Proposed System. The details of each component is discussed in subsequent sections.

B. DOMAIN UNDERSTANDING
In order to understand more about the problem domain, we conducted interviews from various domain experts. The major purpose behind conducting these interviews was to know the different methods that are considered in various departments, like computer science, electrical engineering and business administration for categorization of CLOs and examination question into Bloom's taxonomy. The domain experts included coordinators from international accreditation bodies like Accreditation Board for Engineering and Technology (ABET) and Association to Advance Collegiate Schools of Business (AACSB), HoDs and subject specialist faculty members from different universities. The selection of experts was on the basis of their experience. Because, these are the peoples who are involved in the process of categorizing CLOs and examination questions and can explain different ways of doing this activity in a best manner. The total number of participants were 10, so we manually analyzed the responses of each question asked from the participants. The questions are mentioned in Appendix A. After reviewing all the interviews, we concluded three major points.
1) The categorization of CLOs and questions into Bloom's taxonomy is purely based on human understanding and is domain specific. 2) This is an important activity carried out in academic institutions, for the assessment of course quality as well as examination paper quality needed to quantify student's learning. 3) If a single CLO or question statement contains Bloom's keyword/action verb, which is overlapping on different levels then neighbouring words are checked to differentiate levels, as words are known through their company.

C. DATA ACQUISITION
As far as we know, there is no standard public data set available containing course learning outcomes (CLOs) and questions tagged into Bloom's taxonomy. For this study, a manually tagged data set of Sukkur IBA University is used. This will create a baseline for performing further experiments in this problem domain. Moreover, we have requested a dataset of questions categorized into Bloom's taxonomy from faculty members of Najran University, Saudi Arabia. We will use that dataset as a baseline as well to evaluate our proposed methodology because the authors in [31], [36] have also used this same data set for classification into Bloom's taxonomy. Usually, the faculty members create CLOs in course description documents at the start of the semester and assign those CLOs to questions asked in examination paper. The ABET or AACSB coordinator perform the mapping of CLOs into Bloom's taxonomy and ensure whether the mapping to Bloom's taxonomy is sufficient for maximum student's learning [64]. For this study, we have used two datasets. Table 1 depicts some of the important statistics for both datasets. For Dataset 1, a team from Department of Quality Enhancement Cell (QEC) was asked to manually tag the Bloom's taxonomy (cognitive domain) level to the compiled CLOs statements. The tagging was verified from faculty for related courses from three departments (i-e: computer science, electrical engineering and business administration) to minimize error. Figure 5 explains classwise distribution for Dataset 1. However, the Dataset 2 which we acquired from researchers of another existing study was already tagged into Bloom's taxonomy levels. Figure 6 shows classwise distributing of Dataset 2. We have used Dataset 1 to create baseline for our proposed system and Dataset 2 to evaluate our proposed system in comparison with same existing study on this dataset. Figure 7 shows the abstract model of the proposed system. The proposed system accepts raw CLO / Question text at the input to classify it into one of the Bloom's taxonomy level in cognitive domain. The proposed system performs the following tasks, 1) The first step is a text pre-processing and cleaning that takes the input text and pre-process it by converting into lower case, removing stopwords and punctuation and converting all words to their root words using lemmatization. 2) Once the text is preprocessed, the next step is to compute numeric word vectors using skip-gram based word embedding in order to represent text into numeric features. 3) Finally, we use the Bloom's taxonomy level classifier to classify into one of the pre-defined categories. Suppose, following raw CLO / question text is input, ''Draw the flow chart of the PNK System'' For this text, following triple would be generated, In next few sections, we describe construction and working of each of these components of the proposed system.

E. CONSTRUCTION OF PROPOSED SYSTEM
This section presents construction of all the three modules involved in the proposed system. As shown in Figure 7, the proposed system is comprised of three modules data pre-processing and cleaning, Learning Word Representation using Word Embedding and Bloom's Taxonomy Level   Classifier. The details of each module are discussed in detail in below sections.

1) DATA PRE-PROCESSING AND CLEANING
Several studies have shown that data pre-processing shows better classification results [45]. Therefore, in our collected datasets, we applied several pre-processing techniques to remove non-informative features from the data. In preprocessing, we converted the text data into lower case. In addition, we remove punctuation and stop words using regular expressions and pattern matching techniques. Besides this, we have also performed white space tokenization and wordnet lemmatization to preprocessed text. In tokenization, each question/CLO is converted into tokens or words, then words are converted to their root forms, such as offended to offend using wordnet lemmatizer. Algorithm 1 shows series of different pre-processing steps applied on raw datasets to get clean datasets as output. Figure 8 and Figure 9 shows example of applying pre-processing steps to CLO and Questions data, respectively. After, doing data pre-processing and cleaning the next step in our proposed system is the preparation and splitting of data so that it can be used for model construction and its evaluation. Algorithm 2 explains series of steps which we performed on Dataset 1 and Dataset 2 to prepare training and test datasets for model construction and evaluation.   Moreover, The selection of no. of unique words and maximum length is finalized after several experiments and are shown in later sections. (See Table 9 and Table 10) in the appendix section. Also, the selection of best test size to be used for the maximum performance is also selected after several experiments. (See Figure. 18).

OUTPUT:
Training and Test Data Partition

3) LEARNING WORD REPRESENTATION USING WORD EMBEDDING
One of the major feature used in our proposed system is the semantic representation of words using the Word Embeddings. As both of our datasets are small, usually with small text datasets in deep learning, per-trained word embeddings are used [65]. Therefore, we decided to use pre-trained embeddings to learn efficient word representations for our datasets. We selected one of the recent pre-trained word embedding namely, ''Wiki Word Vectors''. These pre-trained embeddings were developed by Facebook AI Research in total 294 languages in which English is also included. Moreover, these embeddings were trained on Wikipedia text. The authors in [49] have explained these embeddings in detail. We will discuss these details in later section III-E4. We selected this embedding for our task because we expected that this will help us to get semantic similarities of words in a better way due to following reasons: • This embedding is trained on Wikipedia text using technique of generating representation of a word based on its neighbouring words.
• Our datasets consist of maximum words from the Wikipedia corpus.
Additionally, we also experimented with the other pre-trained embeddings like Glove.6B.100D and GoogleNews-vectors-negative300 to evaluate which is better for our task. The performance comparison of these VOLUME 9, 2021 pre-trained embeddings with our proposed pre-trained embedding will be discussed in later section.

4) PRE-TRAINED EMBEDDING ''WIKI WORD VECTORS'' FOR WORD REPRESENTATION
The pre-trained embedding used for this study was originally developed by a team of researchers from Facebook AI Research in 2017. The major motivation behind this development is based on research from neural network community where [66] proposed the use of feed-forward neural network to learn numeric representation of a word, based on occurrence of its left and right neighbouring words. This help the network to build understanding of words which are occurring with each other. The major issue in other pre-trained embeddings like Glove.6B.100D and GoogleNews-vectors-negative300 is, although these are continuous words representations which are trained on large corpus but these representations ignore the word morphology by using distinct vector to each word. This creates a limitation for rare or out-of-vocabulary words which were not the part of the training corpus. The model used to prepare ''Wiki Word Vectors'' pre-trained embeddings is an extension of original continuous skip gram model.
The initial skip gram model proposed by Mikolov [49] is defined as: A dictionary for vocabulary of size K , where each word is identified using index k, defined in equation. 1.
The first assumption for skip-gram model is that a single word can be useful to generate its own surrounded neighbouring words inside a sequence of text. For Example, if we take below text sequence ''student will demonstrate basic proficiency in computer commonly used computer applications' ' We take ''demonstrate'' as the middle target word by setting the words context window size = 2. From Figure 10, we can see that, the skip-gram network model is interested in calculating conditional probabilities for creating the context words, ''student'', ''will'', ''basic'' and ''proficiency'' that are around distance, not exceeding 2 words from the central target word ''demonstrate''.
But, there is not only single central target word for our consideration. For the given, text sequence each word is treated as target word and for each word we need to calculate conditional probabilities for its surrounded context words. So our input data becomes in below form of equation. 2 where K is the dictionary of words from training data, k is the individual word. According to equation 2 the text sequence becomes in the below form with context words window size = 2. ([will, demonstrate], student), ([student, demonstrate, basic], will), ([student, will, basic, proficiency], demonstrate), ([will, demonstrate, proficiency, in], basic), We know that, skip-gram network model tries to learn conditional probabilities of context words for a defined target word. Assuming that, all the context words are generated independent of each other. So, the skip-gram model is interested in calculating below.
Once, the target, context words tuples are formed for complete training data; the next step is to predict and learn word representations for context words with their respective target words using a neural network architecture. The neural network used for skip-gram model is a simple shallow neural architecture with three layers. 1) input layer, 2) single hidden layer and 3) output layer. The input layer is one-hot-encoded version of the input target word whereas, the output layer is the probability function of context words likely to appear with input target word. As, in neural network each successive layer is built by computing dot product of the layer with its weight matrix in addition of a bias using some non-linear activation function like logit, softmax. Therefore, we can define the skip-gram neural architecture as equation 3: where w k and x k are the target word, b and W are bias and weights matrices respectively. Also, logit(w k ) return unnormalized scores at the output layer so we need to apply softmax activation at the output layer to normalize the probability scores defined as equation 4: The Softmax activation function is calculated using equation 5, where n is the total number of target, context words tuples.
117896 VOLUME 9, 2021 Figure 11 depicts the working for skip-gram neural network built for training example of below target, context words tuples representing context window c = 2. P(''student , ''demonstrate ) . P(''will , ''demonstrate ) P(''basic ,''demonstrate ) .P(''proficiency ,''demonstrate ) First, we can see the input target word ''demonstrate'' is converted into one-hot-encoding vector of size of the vocabulary dictionary v = 8. Second, the weight matrix is created of size v x n, where n is the dimension for word embedding and this matrix is represented as p. This matrix represents each vocabulary word as single row and used as input into the hidden layer of size n. As, the c = 2, each training instance will feed forward 4 times to 4 output vectors (i-e: w k−2 , w k−1 , w k+1 , w k+2 ). The hidden layer, linked to output layer with its weight matrix represented as p with dimension of size n x v. Initially, the weight matrices p and p are getting some random values inside neural network and the optimal values for representing these matrices are learned by optimizing the network using back propagation and minimizing the loss according to the equation. This will help the neural network to learn more meaningful word representations for representing context words with their respective target words.

F. LEARNING WORD REPRESENTATIONS USING PRE-TRAINED FOR DATASET1 AND DATASET 2
We computed word representations for our both datasets using pre-trained embedding explained in above section. We obtained embedding matrix of 300-dimension for all unique words in our datasets. This embedding matrix will serve as the weights for embedding layer in our proposed classification model which will help the proposed system to learn how different words are appeared with their different contexts. Figure 12 shows the word embeddings scores of some words from our embedding matrix calculated as per above method.
Furthermore, Figure 13 shows the first 40 unique words according to their embedding matrix. This helps to understand how well these embeddings are computed. The words on the visual which are nearer to each other are the neighbouring words occurring with each other in same context and represents specific Bloom's taxonomy level. However, the different word groups away from each other are representation of different levels of Bloom's taxonomy.
For example, the words ''problem, describe, state, data'' mostly occurs with each other in neighbours. These words have the same context and represents the ''Remembering'' level. However, the words ''following, compare'' occurs with each other in neighbours. Therefore, these two have the same context but different context from previous words. So, they represents the ''Analysis'' level which is different from the level represented by previous words.

G. CONSTRUCTION OF BLOOM's TAXONOMY LEVEL CLASSIFIER
Once, the input data is ready in form of its embeddings the next step is to construct the classification model which will classify input data into different levels of Bloom's taxonomy. For our proposed classifier, LSTM has been chosen for its power in classifying sequences and text is a classical example of sequence. We have used a tagged dataset where the questions and CLOs are manually tagged as per their desired Bloom's taxonomy (cognitive level) categories. Therefore, our proposed classification model will classify a Question / CLO into the desired categories by LSTM network.
As, we have already discussed RNN in section II-D2 where we saw the problem of learning long-term dependencies in order to understand the context in the sequential data. This problem was called ''vanishing gradient'' [58]. There are cases in sequential data, where we need longer sequences in order to understand the context effectively. Let's consider predicting the last word in the text ''I grew up in France. . . I speak fluent French''. The recent information from the word ''speak'' and ''fluent'' indicates that the last word must be name of language. But, to understand or predict language name we need additional context upto the name of the country ''France''. Here, you can see the gap between the predicted word and the required context word is very large. Practically, RNN is not capable of solving these cases. This problem was identified by [67].
LSTM neural networks, are special kind of RNN networks which are capable of learning long-term sequence dependencies. These networks are specially designed to learn information for a long period of time. It has the ability to make decision regarding what information to keep and discard while processing input. Also, it has a gated mechanism to control the flow of the input sequences inside the LSTM cell. Before going into the detail of gating mechanism of LSTM, we need to understand that our proposed sequential neural network model initially processes input sequence in which the each word is represented as w 1 , w 2 , . . . . w n . Then, we have word embedding layer where input words are combined with 300-dimensional word vectors w v and the output is given to the LSTM neural network as given in equation 6. The single LSTM network cell is shown in Figure 14. However, the network cell is step-by-step discussed further below.
The key element of the LSTM is its cell state. The cell state actually works like bridge for information flow. As shown in Figure 14, it is like a horizontal line running through the whole LSTM cell with some linear interactions where C t is the new cell state and C t−1 is the old cell state. The output from the equation 6 will be flowing through this cell state.
Another important components of the LSTM network cells are its gates. These gates actually control the flow of information in different ways. As shown in Figure 14, the pink circle represents the pointwise multiplication operator and yellow box represents the sigmoid neural network layer. This layer VOLUME 9, 2021 FIGURE 11. Sending w k = ''demonstrate'' through the neural network to calculate softmax probabilities for context words (''student'', ''will'', ''basic'', ''proficiency''). actually works on information control inside gates where sigmoid returns 0 if there is nothing needs to be done and 1 if there needs to be something done. Furthermore, LSTM network has three types of gates. The forget gate, the input gate and the output gate. All of these gates are further discussed below.
The first step after the information processing and entering into LSTM network is to decide what information needs to be excluded and what information needs to be considered for the network from the previous output state. This decision is made by forget gate by looking at previous output (h t−1 ) and current input (x t ). As shown in Figure 14, the yellow box represents  the sigmoid layer which results between 0 and 1 for each number in the cell state (C t−1 ). 1 represents ''to consider the information'' and 0 represents ''to exclude the information'' from the network. The neural network equation for forget gate is given in equation 7. The forget remains empty initially as there is no previous output state.
where h t−1 is the output for previous state, x t is the current input state. W f , b f are the weights, bias matrices for the forget gate, respectively.
Next, the LSTM network uses the input gate to decide what information needs to be added into the present cell state C t from the current input. This input gate consists of two neural network layers namely, sigmoid and tanh layers as show in Figure 14. The sigmoid layer decides what values we'll update and the tanh layer creates a new vector comprises of new candidate values, C t to be added into the present cell C t . The sigmoid and tanh neural network layer equations for input gate are given in equation 8 and equation 9, respectively.
where h t−1 is the output for previous state, x t is the current input state. W i , b i are the weights, bias matrices for the input gate sigmoid layer, respectively. Also, the W C and b C are the weights, bias matrices for the input gate tanh layer, respectively. Once, the new candidate values vector is created in C t it's time to update the old cell state C t−1 into the new cell state C t , as shown in Figure 14. This is done by multiplying output from equation 7 (i-e: f t ) with output of previous state C t−1 . Next, the product of both input gate equations (i-e: equation 8 and 9) is also added into it. The output of the new cell state is given in equation 10.
where f t is the forget gate output, C t−1 is previous state output, i t is the input gate output and C t is the new candidate values' vector. Finally, we need to decide what to generate for output. Here, LSTM uses its last gate ''the output gate''. This output gate decides to send specific information as output from the cell state C t . As shown in Figure 14, the sigmoid layer in output gate decides the part of information for the output, then the new cell state C t is put into tanh layer to make the values between (-1 and 1). Lastly, the output of sigmoid layer is multiplied to just output the selected information. The sigmoid and tanh output layer are given in equation 11 and 12, respectively.
where h t−1 is the output for previous state, x t is the current input state. W o , b o are the weights and bias matrices for the output gate sigmoid layer, respectively. Also, the C t is the new cell state and h t is the output after the output gate. As the output from the hidden layer h t is between (−1 and 1) and these values are not normalized. At the end, we need the probability distribution of total neurons defined at the output layer which are equal to the total class labels defined for classification. Therefore, the hidden state h t output from LSTM layer is given to the dense connected output layer connected with softmax activation function which takes h t and convert it into the normalized values over probability distribution of N possible outcomes. The maximum probability score for the redicted class label is selected as the desired input. The final output of our proposed LSTM neural network layer is given in equation 13.

H. PROPOSED LSTM BASED SEQUENTIAL MODEL FOR CLASSIFICATION
The input to the LSTM network layer in our proposed model is the preprocessed Question / CLO statements combined with pre-trained embedding from ''Wiki Word Vectors'' of 300-dimensions. The output from LSTM network layer is probability distribution upto six Bloom's taxonomy labels (Remembering, Understanding, Application, Analysis, Evaluation and Creating). Figure 15 and 16 shows the summary of proposed LSTM model for CLOs and Question classification.
As, the size of dataset is small therefore we have used dropout rate of 0.2 at the LSTM layer in order to avoid model overfitting. The reason for the selection of 0.2 as the dropout value is where our proposed model performs best after several experiments. Moreover, in order to keep the efficient learning we have used ''Adam'' optimizer for CLOs classification model and ''RMS'' optimizer for Questions classification model because they usually work better for small datasets. Table 11 and Table 12 in the appendix show the process of selection for best optimizer, dropout value and batch size. Again, we have used a typical deep learning hit and trial process to fine tune these hyper parameter. We have applied ''Categorical Crossentropy'' loss function which works best for multicalss classification with balanced/imbalanced class distribution.

1) PRECISION
It is ratio of correctly predicted values for a specific class with respect to all predicted values in that class.

2) RECALL
It is ratio of all predicted values for a specific class with respect to actual values in that class.

3) F1-SCORE
It is a harmonic mean of precision and recall. It a balanced ratio of both precision and recall.

4) ACCURACY
It is a ratio of correctly classified instances with respect to all values.
Accuracy = (TP + TN) TP + FP + TN + FN (17) We have used average of class wise accuracy as a major evaluation metric to evaluate the proposed classification model for CLOs and Questions into Bloom's Taxonomy. However, we have also discussed other evaluation metrics including confusion matrix as well.

IV. EXPERIMENTAL RESULTS
This section shows the experimental settings and the detailed results obtained after the several experiments performed in our work. We explained different types of experiments performed for the evaluation of the proposed system.

A. USER DEFINED DATASET (DATASET 1)
The second component on which the proposed system depends is the CLOs tagged into Bloom's taxonomy (cognitive levels). Unfortunately, there is no such benchmark data-set available with ground truth values for this purpose. Therefore, we constructed a dataset of CLOs taken from the Sukkur IBA University. The department of quality enhancement cell (QEC) manually tagged 828 CLOs as per their Bloom's taxonomy level. These domain experts have been conducting different training for faculty members in understanding this tagging. The tagging was again verified from different faculty members from four departments (Computer Science, Electrical Engineering, Business Administration and Mathematics) using an online application where each faculty member logged in with its username and assign Bloom's level to the CLOs related to their courses. The CLOs were already available in their respective course outline / specification document, however, we had to spend time on its compilation in single table and manual tagging of these CLOs into their respective category. We used this dataset to create a baseline for upcoming researches in this field. The section III-C shows some of the major statistics for both datasets.

B. BENCHMARK QUESTIONS DATASET (DATASET 2)
The proposed system considerably depends on the variety of questions tagged into Bloom's taxonomy (cognitive levels). The questions must reflect maximum evaluation of student's learning based on Bloom's taxonomy levels. Therefore, we obtained such pre-built dataset from faculty members of Najran University, Saudi Arabia. In this dataset, the 600 questions are classified into six different levels of Bloom's taxonomy. We used this dataset as a baseline and evaluate our proposed system performance in terms of accuracy. The previous authors have reported classification accuracy of 84% in [31] and 89% in [36] using traditional machine learning approach.

C. EXPERIMENTAL SETTINGS
The table 2 depicts final experimental settings which we applied in order to get maximum accuracy. The accuracy measure is used here in order to evaluate the performance of both classification models. The model parameters mentioned in the table 2 are the final parameters for model training. For the best number of epochs selection process, we analyzed the overfitting and underfitting graphs as given in Figure 19. The

D. KEYWORDS BASED APPROACH RESULTS
This section explains the initial keyword based approach results, which we applied in order to set the baseline results. This is important to set baseline results because it will be used to evaluate the performance of proposed model. The reason to use this approach is because this is the original approach used in the literature and practical as well for Bloom's taxonomy classification. However, this approach suffers from one of the major problem explained in section I. To use this approach, initially we build the keywords/actions verbs dictionary representing six levels for the Bloom's taxonomy. We extracted major action verbs in each level of Bloom's taxonomy from different relevant sources. The list of these action verbs was already shown in Figure 2, previously.
Once, the dictionary is built we preprocessed and queried the CLOs and Questions statements from our datasets in order to search for action verbs/keywords from the dictionary. The decision to assign Bloom's taxonomy level to CLO/Question was based on following steps.
• If the text contains single action verb with maximum frequency equal to 1 and it belongs to only one Bloom's level then the desired level is assigned to it.
• If the text contains single action verb with maximum frequency equal to 1 and it belongs to more than one Bloom's level then we randomized the Bloom's levels to assign the randomized level.
• If the text contains multiple action verbs with different frequencies then the verb with maximum frequency is used to get its Bloom's level and assigned it.
• If the text contains multiple action verbs with equal frequencies then again we randomized the Bloom's level to assign the randomized level. The use of randomization is necessary here in order to make a decision. Because, in case of multiple levels for a single action verb we simply cannot decide the exact Bloom's level without understanding its context. Once the Bloom's level is assigned, it is validated using Bloom's level assigned by domain experts. This approach is further explained in algorithm 3.
Moreover, the results obtained using this approach for Dataset 1 and Dataset 2 are shown in table 3. classification models for CLOs and questions using pre-trained wiki-word vectors. We applied two different approaches in order to evaluate accuracy metric for classification model namely Train/Test split and Cross Validation. The subsequent sections explain both of these approaches.

1) TRAINING, TEST SET PARTITIONS
To start with conducting experiments, the manually tagged Dataset 1 (CLOs Dataset) was divided into different proportions of training and test sets. Initially, to create a start point the ratio of 50:50 was kept for training and test sets. The model obtained the accuracy of 65% on this first proportion. To further assess whether this distribution has any impact on model training, we increased the proportion gradually and found that our proposed model is also improving while learning from more data. The highest accuracy of 74% was obtained on the proportion of 75:25. After this proportion, the model started to decrease the accuracy. We set this distribution as break point for training, test proportions because further increasing the proportions results in no significant increase in the accuracy.
Furthermore, we performed several experiments for Dataset 2 (Questions Dataset) as well. To start working with that dataset, initially we started with 50:50 ratio for training and test set. The model obtained the accuracy of 72%. Also, to check impact of training and test set proportions on model training we gradually changed the proportions and observed that the model is improving. Finally we stopped at the proportion of 95:5 for the training and test set where model got highest accuracy of 87%. This accuracy is 3% more than the research study [31] where the authors had performed the same task and reported highest accuracy of 84%. However, Mohammed Manal et al. applied basic machine learning techniques to perform same task on this same dataset in [36] and the reported accuracy was 89%; which is the maximum accuracy as compared to our proposed work and reported previously in [31]. But, this suffers from the two major problems. 1) Generalization: In general terms, this is a ability of the trained model that how it performs for unseen/new data. Usually, deep learning models need large amount of data to achieve maximum generalization. Although, in our case the amount of data was small but still there is only 2% difference in our accuracy and accuracy reported by [36]. This shows that even on small amount of data our proposed model is generalized well. Also, the generalization achieved by deep learning model is more effective than the one achieved by traditional machine learning models due to some of its benefits like automatic feature learning, hierarchical layer architectures, use of word embeddings, etc [68]. 2) Scalability: Usually, we need large amount of data for the deep learning models to perform well. This is one of the main bottleneck for our proposed approach when we applied it on dataset 2 and got lower accuracy than the approach proposed in [31]. In future, we can improve our proposed approach with more data and we can easily beat the traditional machine learning models because these models are not very good when it comes to data scalability. The accuracy, precision, recall and F1-score for both the datasets for our proposed approach are shown in Table 4. For dataset 1, we have shown the weighted-average of all the metrics because it is highly imbalanced. However, for dataset 2 we have shown macro-average of all the metrics due to its balanced nature. Furthermore, Figure 18 depicts relation of different training set proportions with test set accuracy.
Initially, the built model for CLOs classification using Dataset 1 with the proportion of 75:25 for training and test set was trained with 25 epochs. But, Figure 19 shows that there is no significant increase in accuracy after the 8 epochs. Therefore, we trained the CLOs classification model for 8 epochs. However, the model built for question classification using Dataset 2 with the proportion of 95:5 for training and test set was trained with 30 epochs at the start. But, Figure 19 depicts that the highest accuracy the model obtained was at the 20 epochs. Therefore, we trained the question classification model for 20 epochs only. Figure 19 explains the behaviour of model learning in terms of accuracy with different no. of epochs.

2) K-FOLD CROSS VALIDATIONS
In this research study we have tried to implement deep learning based classification models on small real world  datasets. Therefore, we have also evaluated its performance using k-fold cross validations in addition to train / test partitions. This technique randomly divides dataset into K distinct chunks where K-1 chunks are used to train the model and K chunk is kept as unseen in order to test the model performance using accuracy metric. We applied K = 10, fold cross validations in order to understand the model behaviour over both small datasets (i-e: Dataset 1 and Dataset 2). Table 5 depicts accuracy of both models where we took two highest accuracy values in 10 folds and reported its average.  taxonomy (cognitive domain). Table 6 shows the details of this comparison analysis. The main objective of all these studies was to classify assessment/questions items into Bloom's taxonomy (cognitive domain), which is actually the original objective of our research study as well.

G. OTHER EXPERIMENTAL RESULTS
As, stated earlier that due to small size of data we cannot learn efficient word vectors representations from data itself. Therefore, we used ''Wiki-Word-Vectors'' pre-trained embeddings in order to learn efficient neighbouring based context word representations. The other research studies in pre-trained  embeddings domain also suggest the use of two famous pre-trained embeddings Glove.6B.300D and GoogleNewsvectors-negative300D as well. Therefore, in order to evaluate the performance of our proposed pre-trained embedding we performed experiments with same classification models and dataset distributions with Glove.6B.300D and GoogleNews-vectors-negative300D pre-trained embeddings. Table 7, shows the classification performance of the proposed pre-trained embedding with two other pre-trained embeddings. Furthermore, we know that for small amount of data most of the time traditional machine learning classifiers works very well. Therefore, we tried to compare the performance of our proposed classification model with state-of-the-art machine learning algorithms used for text classification. We used four traditional machine learning algorithms (SVM, Multinomial Naive Bayes, Logistic Regression and Random Forest) and two deep learning networks (Dense and SimpleRNN). The dataset was distributed in the same proportion as our proposed classification model. We used same pre-trained embedding used in our proposed classification model for Dense and Simple RNN networks. However, for traditional machine learning algorithms, we used the TF*IDF approach for feature representation. The table 8 depicts performance of all these other algorithms in terms of accuracy. The other models have perfomed very well but are still lower in accuracy as compared to our proposed model. Figure 20 and 21 depicts the overall results of other experiments which we performed to evaluate the proposed model for CLOs and Question Classification into Bloom's Taxonomy. Figure 22 and Figure 23 show the confusion matrices of our proposed classification model for CLOs and Questions Classification. Figure 22 shows the confusion matrix for proposed LSTM model for CLOs classification. As shown here, out of 16 CLOs belonging to Remembering class, 14 were correctly classified. However, the only 2 instances were incorrectly classified into Understanding class. The Understanding  and Application class contains major proportion of the test set with total 71 and 54, respectively. Out of these total instances, 61 and 46 were correctly classified, respectively. The remaining 26 instances out of 167 test set instances belongs to Analysis and Evaluation class. Here, the proposed model correctly classified the 19 and 1 CLOs into respective classes. We can see from the diagonal of the confusion matrix which actually represents maximum number of correctly classified class-wise instances even though three of total five classes contains very few instances.

H. CONFUSION MATRIX FOR PROPOSED LSTM MODEL
However, Figure 23 shows the confusion matrix for proposed LSTM model for Questions Classification. As shown in the confusion matrix, there are overall 6 classes in which total 30 questions are divided. There are 4 questions correctly classified out of 5 in three of six classes namely, Remembering, Understanding and Application respectively. The 10 questions from Analysis and Evaluations classes are 100% correctly classified. However, the last Creating class has showed the lowest inter class performance where only 3 questions are correctly classified out of 5 questions.

V. DISCUSSION
In the experimental results, we have evaluated our proposed LSTM model for classification of CLOs and Questions over categories of Bloom's taxonomy. The experimental results showed the proposed LSTM classification model in combination with Wiki Word Vectors Pre-trained word embeddings gives the best results as compared to other pre-trained  embeddings and state-of-the-art text classification models. The theoretical analysis of results is discussed in subsequent sections.

A. PROPOSED PRE-TRAINED WORD EMBEDDING
The use of pre-trained embeddings for word representation is an efficient way where the size of data is relatively small. In this study, we compared our proposed pre-trained word embedding namely ''Wiki-Word Vectors'' with other two pre-trained word embeddings namely, Glove.6B.300D and GoogleNews-vectors-negative300D. The experimental results showed that of these three embeddings, our proposed embedding outperformed. Conversely, the other two word embeddings exhibited lower results. The possible reason behind the best performance of ''Wiki-Word Vectors'' is that it considers neighbour words in order to understand context. This gives different representations for same word in different contexts. Moreover, it uses sub-word information by breaking down words into character level in order to compute word representations for out-of-vocabulary words as well. The details are explained in section III-E4.
The possible reason behind the lower performance of other two embeddings is due to its inability to handle unknown or out-of-vocabulary words particularly in the academic domain. Although, these embeddings are continuous word representations in form of dense vectors but they do not take into account word morphology. Therefore, most of the outof-vocabulary words are not computed from these embeddings and the representations become NULL. Overall, Table 7 and Figure 21 explains the performance comparison of three pre-trained embeddings.

B. PROPOSED LSTM CLASSIFICATION MODEL
Text classification using deep learning algorithms is a very active approach these days. Therefore, we employed deep learning based LSTM Network for our proposed classification model. In order to evaluate performance of our proposed model, we compared its performance with other state-ofthe-art machine as well as deep learning classification models. The experimental results proved that our proposed LSTM Network achieved the excellent performance for the classification of CLOs and Questions into Bloom's Taxonomy as compared to other state-of-the-art classifiers like SVM, Multinomial Naive Bayes, Logistic Regression, Random Forest, SimpleRNN and Dense Network. The interesting point under discussion here is the traditional classification algorithms which we employed except SimpleRNN and Dense Network have shown reasonable accuracy but are still less than our proposed LSTM Model. However, the SimpleRNN and Dense Network has shown very poor results. The possible reasons may be that dense or simpleRNN network architectures are not efficient for learning longer text sequences. Our proposed model outperformed possibly because LSTM is efficient in order to consider long text sequences for context understanding. The actual problem in this classification task is the understanding of context because different categories of Bloom's taxonomy have same repeated action verbs. Only, the context understanding using longer sequences can solve this classification problem. In addition, the LSTM network has the ability to control different information using its gating mechanism as explained in section III-G. The table 8 and Figure 21 depicts the performance comparison of six stateof-the-art classification models.

C. CLASS WISE PERFORMANCE FOR PROPOSED SYSTEM
We developed two model variations from our proposed classification model, one with 5 classes (i-e: CLOs Classification Model) and another one with 6 classes (Questions Classification Model) based on the manually tagged datasets (Dataset 1 and Dataset 2), respectively. As shown in Figure 22, the results showed that the CLOs classification model performed very well for first 4 classes (i-e Remembering -Analysis) by giving precision of 85%+. But, the last class (Evaluation) gives the precision of 33% which overall reduces the average performance of the model. The reasons for these poor results for last class is the unavailability of tagged CLOs for this specific level. Because, this class only contained total 10 instances, out of which 7 were used for training set and 3 were used for test set according to 75:25 training / test ratio.
The precision for this class and overall performance of the model can be increased with the availability of more tagged data for this specific level. Also, another reason for low performance results is the imbalanced data in all of 5 classes. Hence, fixing these issues might increase the overall average performance of the proposed model.

VI. LIMITATIONS
Although, the proposed model demonstrated reasonable performance in classifying CLOs and question items into Bloom's taxonomy but still it suffers from various limitations. For example, Figure 23, shows that the CLOs classification model performed exceptionally well for 2 classes (i-e Analysis and Evaluation) by giving precision of 100%. However, for the first three classes (i-e: Remembering -Application) the model gives overall precision of 80% which is also quite satisfactory. But, the last class (Evaluation) gives the precision of 60% which overall reduces the average performance of the model. The model is trained on CLOs and assessment items in English language only. Therefore it will be difficult for the model to predict the Bloom's taxonomy level for non-english statements. The proposed model is trained on CLOs and assessment items from different subjects including computer science, electrical engineering, social sciences. However, for the CLOs and assessment items for subjects from medical and law domain, the model can perform low.

VII. CONCLUSION
The categorization of CLOs and Questions into Bloom's taxonomy is a purely domain expert task because it involves thorough understanding of assigning specific Bloom's taxonomy level in order to maximize the student's learning. The manual task is actually time consuming, laborious and often leads to mistakes due to human biasness. In this research, an automatic classification system is proposed for the classification of CLOs and Questions into Bloom's taxonomy using domain understanding. The categories used for classification were Remembering, Understanding, Application, Analysis, Evaluation and Creating. Our proposed model initially understood the domain for each different level using a manually tagged datasets of CLOs and Questions into Bloom's taxonomy levels. The model adapted the domain understanding using skip-gram pre-trained embedding namely, ''Wiki-Word Vectors''. This takes into account the context of neighbor words. Once, the model adapted enough domain understanding then it started to classify the CLOs and Questions into specific category using single layer LSTM model. The performance of the model was evaluated by accuracy metric using train/test split and 10-fold cross validations. However, we also evaluated the model performance by comparing it with two other pre-trained word embeddings and six state-of-the-art classification algorithms from the literature. During all these comparisons, our proposed domain based word embedding and LSTM model outperformed. We obtained very encouraging results for CLOs Classification (74%) and Question Classification (87%) as the dataset was relatively small upto just few hundreds instances. We also supported our model performance with an existing study whereas our proposed model achieved 3% increase in accuracy as compared to that study for the same task.

VIII. FUTURE WORK
This work of automatic classification of learning outcomes and questions into Bloom's taxonomy can be further extended by developing specific domain based word embeddings by collecting large amount of CLOs and questions. We can process that amount and build specific skip-gram based word embedding. Also, various other neural network architectures like GRU, Deep Memory Networks or Ensemble Deep Learning models can be evaluated rather than LSTM to evaluate the performance of the proposed system. As, this work is based on supervised classification therefore another work that can be done is to develop a standard tagged dataset of CLOs and Questions with balanced classes and thousand of instances. This will make the learning of deep learning classifier more efficient.
Another, natural language processing based approach can be used in which we can work on the role of meta data (i-e: length of the text, etc) in classification of those keywords which are overlapping. Also, another extension may be the use of voting classifier because our dataset is tagged in three different ways. 1) Human-labelled, 2) Keywordsbased-labelled and 3) machine-learning-model-labelled. So, we can create a voting classifier on top of all these classifications. However, this classifier is only for cognitive category but there are two other categories of Bloom's taxonomy (i-e: Affective and Psychomotor). Same architecture can be used for these two categories by using same tagged dataset approach.