Student Modeling and Analysis in Adaptive Instructional Systems

There is a growing interest in developing and implementing adaptive instructional systems to improve, automate, and personalize student education. A necessary part of any such adaptive instructional system is a student model used to predict or analyze learner behavior and inform adaptation. To help inform researchers in this area, this paper presents a state-of-the-art review of 11 years of research (2010-2021) in student modeling, focusing on learner characteristics, learning indicators, and foundational aspects of dissimilar models. We mainly emphasize increased prediction accuracy when using multidimensional learner data to create multimodal models in real-world adaptive instructional systems. In addition, we discuss challenges inherent in real-world multimodal modeling, such as uncontrolled data collection environments leading to noisy data and data sync issues. Finally, we reinforce our findings and conclusions through an industry case study of an adaptive instructional system. In our study, we verify that adding multiple data modalities increases our model prediction accuracy from 53.3% to 69%. At the same time, the challenges encountered with our real-world case study, including uncontrolled data collection environment with inevitably noisy data, calls for synchronization and noise control strategies for data quality and usability.


I. INTRODUCTION
Education has become increasingly concerned with the efficiency and effectiveness of the typical one-size-fits-all model that offers a common, rather than individualized, set of problem-solving instructions to every student in a class [1], [2]. One-size-fits-all models are ineffective because unique students have different educational factors, such as preferred learning styles, personality types, and cognitive potential. As such, a predominant issue emerging in learning technologies is how to best extract such personalized differences and provide appropriate customized support.
Adaptive Instructional Systems (AIS) are computer programs built on artificial intelligence, data mining, and learning analytics. AISs are implemented in educational settings The associate editor coordinating the review of this manuscript and approving it for publication was Laura Celentano . as intelligent tutoring systems (ITSs), adaptive simulation systems, and serious games. These systems continue to be explored as a solution to the issue of personalized student education, with successful applications in schools, business, and government [3]- [5]. Despite consistent evidence of their effectiveness in improving student learning, AISs have as of yet failed to achieve major, widespread adoption. This failure can be attributed to high development costs [6], [7], as well as additional system constraints in both aligning the system to a curriculum and training instructors on proper use of the system [6], [8]. Lack of existing digital infrastructure can also limit both a researcher's ability to collect data and a student's ability to access the system [8]. However, it is our firm belief that recent and upcoming advances in technology will continue to make AISs easier to use, develop, implement, and modify for new curricula. Thus, the motivation for this study is to survey recent technological developments in two VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ research areas that strengthen the ability of AISs: student modeling and analysis. For any AIS to successfully mimic human tutors in providing ''just-in-time'' or ''just-enough'' instructional support tailored to student needs, the AIS must acquire an accurate understanding of learners through a student model [9]. Typically, student modeling is a systematic process of constructing mathematical student models that represent, analyze, and predict learner cognitive, behavioral, and/or affective states by exploiting digital traces of learners and their interactions with AISs. This process enables AISs to a) identify learner prior knowledge, b) isolate underlying misconceptions, c) predict learning goals and plans, and d) understand learner personalities and emotions.
The selection of learner characteristics and their associated learning indicators is critical for designing student models to fit the desired adaptation. In fact, categorization of learners can be based on static information that does not change during the learning process, as well as dynamic information that evolves with the process. The former is usually related to personal traits, such as age, gender, and cultural background. The latter encompasses several dimensions that can be measured directly or indirectly during learning. For instance, students often encounter various experiences that fail to match their expectations, such as encountering obstacles while working toward a goal, choosing equivalent alternatives, and even making mistakes. Consequently, learner emotions fluctuate with learning and often reach a state of confusion or frustration [10].
A perusal of the current literature provides many student modeling approaches, which can be categorized into two major domains. The first line of work stems from the traditional overlay model [11], which presumes that the knowledge/skills of students is a subset of that of domain experts. The degree to which a learner knows a concept is often measured quantitatively as a probability [12], or qualitatively using fuzzy logic to address the high level of human uncertainty [13]. Variations are also seen in a hybrid effort that integrates overlay models with other methods, such as clustering [14], stereotypes [15], and ontology [16]. This development sought to overcome the inability of overlay models for learners' partial knowledge and/or incorrect knowledge. With the advancement of artificial intelligence and sensor informatics, data-driven approaches constitute a second domain in student modeling. Rather than relying on expensive domain knowledge as a prior, they collect a significant quantity of measures about students before running computational methods to derive models. It is clear that many student modeling methods vary in terms of the types of learner profiles and learning indicators used, but most importantly, they vary in terms of their limitations.
Although there are several reviews on similar aspects of student modeling in adaptive learning environments [17]- [19], more work continues to appear. Part of the reason for so much interest in the field is due to the complexity of the problem itself and many challenges associated with it.
Even so, the increasing interest has yet led to few efforts to highlight multimodal learning analytics as a contributing factor in adaptive instructional systems. As such, this paper centers on student modeling and analysis developed in the past decade and makes the following contributions. First, this paper presents a state-of-the-art review, detailing how various student models have been implemented in AISs with respect to their pros, cons, and suitable learning contexts. Second, the paper categorizes multimodal data inputs and overviews successful implementations of multimodal data collection methods, ultimately leading to a discussion on the difficulty of in-classroom multimodal learning analytics (MMLA) implementations. Last but not the least, the paper offers important perspectives and promising future directions in the field with empirical evidence of the benefits and challenges of MMLA through an industry case. The rest of the paper is organized as follows: Section II details our review of the emerging student models; Section III discusses the importance of different types of data that leverage student models to analyze learner behaviors. Our observations and discussions on the limitations, challenges, and future directions are presented in Section IV, followed by the conclusion in Section V.

II. EMERGING STUDENT MODELS
As stated in the introduction, student modeling is essential for an AIS to provide students with adaptive and personalized learning. The goal of student modeling is to represent and trace an individual student's knowledge state. Only with a precise understanding of a student's knowledge level and accurate prediction of their future performance can the AIS generate an optimal and individualized learning path. This section focuses on the development of modeling techniques for knowledge tracing over the past decade.

A. OVERLAY-BASED MODELING
The overlay model, invented by Carr and Goldstein [11], is a classic approach to student modeling. It usually constitutes a static domain model and a dynamic student knowledge model. The former is built by a human expert, denoting expert knowledge as individual concepts and topics, and the latter is then superimposed on the top of the former. As reported in two earlier survey articles [17], [20], the domain knowledge model and the learner model employ a uniform construction to represent knowledge as independent elements. A student's knowledge acquisition is simply a mastery of each knowledge component. When the student interacts with the AIS, the overlay model continuously tracks his knowledge state and its evolution. Afterwards, the overlay model performs diagnosis by comparing the represented student's knowledge with that of the domain/expert and recognizing the difference. Therefore, the goal of the AIS is to eradicate this difference in order for the student to reach the knowledge level of the expert.
One of the most common types of overlay modeling is the use of Bayesian networks (BN). Corbett and Anderson (2005) [12] first introduced Bayesian knowledge tracing (BKT) to model and monitor a student's knowledge state during learning. In particular, nodes in a BKT are domain concepts or observation of students' behaviors, each of which is assigned a probability that a student has learned the knowledge or has behaved correctly. A directional edge connecting a parent node to a child node is associated with a conditional probability that reflects their conditional dependence. All probability relations of a discrete node with its parents are then kept in a probability distribution table (CPT) associated with the node. The probabilities and conditional probabilities are continuously updated through Bayesian inference when more evidence of a student mastering a node knowledge becomes available.
In view of the capabilities of overlay models, such as BKT, many AISs are still built upon this old technique with variations to mitigate their limitations. For instance, prediction in the traditional BKT is performed using all observational data (no discrimination). Presumably, all users share the same initial knowledge when starting their learning sessions. This assumption might work when the data pertaining to student characteristics exhibits high uniformity. However, not all students are alike. Because of such diversity in students, it is highly possible for an AIS to initially offer study materials that are incompatible with a student's ability, and then spend time exploring data and correcting this mistake. Taking this into consideration, Nedungadi and Remya (2014) introduced an enhanced personalized clustered BKT (PC-BKT) model [14]. The method leverages a personalized model in which students are first distinguished into groups of similar knowledge levels at the beginning of the learning session. In doing so, students are not assumed to be the same initially, but are dynamically clustered by their skills into groups of high, medium, and low knowledge levels. Following this, predictions on their performances are conducted in individual groups.
Similarly, the individualization issue was addressed in the DEPTHS ITS system [15] through stereotype clustering, where students are grouped into one of the three stereotypes: beginner, intermediate, and advanced via a self-judgment of learning abilities. Subsequently, each student is assigned a learning session compatible with his initial stereotype until he completes the first test, the result of which, along with the initial knowledge of the student, contributes to the development of an overlay model.
Grouping based solely on students' knowledge test results and/or self-judgment of abilities may not be sufficient to characterize students, resulting in a system not being as personalized and adaptive as desired. In fact, several other factors, such as the difficulty levels of tests, the time a student spent in solving the problem, and the learning material he has visited, should be considered when judging a student's performance [15]. The use of the ontology technique is in line with logical thinking, allowing for a more comprehensive student clustering.
Ontology encompasses a representation of the concepts/categories within a subject area and shows how they are related. The purpose of ontology in student modeling seeks to capture student characteristics from as many aspects as possible. Such multidimensional information to be elaborated in detail in Section III includes students' profiles (e.g., learning styles [21], cultural and gender differences), learning behaviors (e.g., responses to question prompts [14], [22], misconceptions [23], and emotional states [9], [24]), and learning environment (e.g., physical locations, contexts, and cultures in which students learn [16]). Consequently, the merger of an ontology technique guarantees a more accurate student prediction of future performance based on the diverse and abundant information classified and stored in an ontology.
Learning is a continuous process that involves many complications. The measures during the process of student mastery of a particular knowledge element often contain imprecise or incomplete information, which makes it difficult to represent knowledge mastery as learned or not learned. Similarly, human subjectivity exists in the process of predicting student performance. The idea of augmenting an overlay model with fuzzy logic [13], [25], [26] have focused on tackling the uncertainty issue in inferring a student's performance [27]. Fuzzy logic is a rule-based reasoning technique that provides a good way to represent vague input data. Instead of using ''Yes or No'' Boolean logic, fuzzy logic transforms a ''crisp'' input into one or more fuzzy sets through membership functions. The ''degrees of truth'' are then calculated to specify the extent to which the input belongs to a given fuzzy set [28]. One such example can be seen in [13], where multi-valued fuzzy sets (Unknown, Unsatisfactory Known, Known, Learned) and associated membership functions were proposed and integrated with a traditional overlay model. In doing so, the pure overlay model evolved into a qualitative weighted model. The determination of a node weight was then performed by mapping a test score for the student's knowledge level of a domain concept into the predefined fuzzy sets with different truth degrees in the range (0, 1). In addition to fuzzy logic, the work presented in [29] followed a similar line of thinking, where a trisection partition of a learning process as unlearned, learning, learned replaced the original bisection one.
Another shortcoming of overlay models is the inability to represent a student's mistakes or misconceptions of domain knowledge. To that end, many researchers have incorporated constraint-based models (CBMs) to compensate for the deficiency [30], [31]. The CBM is based on Ohlsson's theory of learning from errors [32], identifying learning as a process by which mistakes are first detected and corrected. Thus, constraint sets are first built with all possible correct answers. When a student's answer fails to match the condition, a constraint is violated, and a mistake is recognized. One example of this approach is the work presented in [33], where a comprehensive student model built on CBM, fuzzy logic, and overlay was proposed. During the learning process, the CBM in the student model detects student mistakes by evaluating the words present in student answers. The detected mistakes and the fuzzy logic decision system are then used to update and determine the mastery level of the knowledge component in the overlay model.
While overlay models have been used broadly across a variety of academic domains, more research efforts continue to devote in the direction of improving their representational power and prediction accuracy. For instance, dynamic Bayesian networks (DBNs) were proposed to model the hierarchy and relationships of different learning domain skills [34], [35]. Perceived as a composite model of multiple hidden Markov models (HMM), a DBN not only represents a specific skill of a given learning domain similar to the traditional BKT, but also models the dependencies between different skills via conditional probabilities. In addition to the observed and latent variables considered in the traditional BKT, the works presented in [22], [36], [37] extended the original structure of BKT by introducing different types of nodes or studentspecific parameters. To account for the influence of problem difficulty on student behaviors, Pardos and Heffernan (2011) added item difficulty nodes connected to the question nodes in the traditional BKT, conditioning the question's guess/slip upon the value of the item node [36]. Similarly, Qiu et al., (2011) introduced a new type of time node to model the time effects on students forgetting previously learned materials and/or making mistakes [37]. In view of the assumption applied to the traditional BKT that students learn, guess or slip at constant rates, Yudelson et al. (2013) proposed a gradient-based optimization to individualize those speeds [22].

B. DATA-DRIVEN APPROACHES
While overlay models substantially contribute to the efforts of knowledge tracing in adaptive instructional systems (AISs), they demand human/experts to build a basic domain knowledge model with a complete list of topics and concepts to be learned by students. The necessity of prior domain knowledge, potential uncertainty, and human subjectivity involved in overlay models present practical limitations. In contrast, data-driven models have become popular with the advancement of artificial intelligence and sensor informatics. In particular, the availability of increasingly largescale datasets collected from AISs and massive open online courses (MOOCs) makes deep learning models more favorable [38]. One study in particular used deep learning to accurately predict at-risk students using the dense data available from virtual learning environments [39].
Inspired by deep learning, Piech [40] developed a deep knowledge tracing (DKT) algorithm that utilizes recurrent neural networks (RNNs) to track a student's knowledge states. RNN is good at processing sequential data for prediction, where a looping mechanism is embedded that acts as a bridge to allow information to flow from one step to another [30]. In sequential computation, the sequence of interaction encoding is mapped to a sequence of hidden layers, and then to a sequence of vectors. In DKT, student responses serve as inputs, latent knowledge states are hidden layers, and the predicted probabilities of correctness for questions are outputs [40]. There are many variations of DKT as reported in [41], some changed the network structure [42], [43], while others include different faceted student information [44]- [49]. In general, DKT substantially saves the efforts of human experts, as it does not need to explicitly construct a prior domain knowledge model. In addition, DKT manages to handle large-scale datasets, permitting multiple dependent variables as input vectors as long as they can be vectorized [50]. In addition to DKT, factorization machines are another popular method for knowledge tracing, many of which [51]- [53] have been proven to match the performance of DKT.
As mentioned earlier, the manual construction of an expert model is strikingly challenging, time-consuming, and error prone. The adoption of machine learning (ML) techniques can automate this process in a more efficient and effective manner. One such example of ML-based AIS authoring tools is SimStudent [54], a system that allows a domain expert to create an expert model by tutoring an AI-simulated student. During interactions, the expert presents question prompts to SimStudent and offers feedback on SimStudent's responses to questions or requests for hints. Throughout all interactions, SimStudent induces underlying rules to create an expert knowledge model. Similar works can be found in [55], where a data mining technique was used to automatically induce a partial knowledge model based on students' responses.
In addition to discovering underlying knowledge models for learning materials, ML has been broadly used to anticipate student intentions, actions, and performance. For instance, a two-stage supervised learning method was proposed in [56] to detect when and what a student just learned (i.e., the content learned at a specific time step). The work in [57] used least squares and ridge regression to automatically detect off-task behaviors in ITS. Other efforts have been made in discovering student misconceptions [58], [59], student behavior [60], learning styles [61], [62], and emotions [63].

C. DISCUSSION
Modeling and predicting student behavior during learning is a complicated process. Such complexity and associated challenges have generated tremendous interest in student modeling, resulting in many approaches varying in terms of methods and, more importantly, limitations. Overlay-based modeling leverages the benefits of expert knowledge to enable adaptation. However, it is rarely applied to deal with cold-start problems owing to the lack of sufficient prior information. While many extensions tackled a distinct deficiency of the traditional overlay model, their common underlying idea was to implement more variants in inferring student behavioral, cognitive and/or affective states. For instance, student behaviors can be derived from examining features associated with student interactions with an AIS during learning, as well as student static traits such as learning styles, preferences, and personality traits. Similarly, a cognitive state can be influenced by multiple factors, including motivation, emotion, and aptitude. Although additional elements in student modeling can potentially elevate an AIS's likelihood of providing accurate feedback to a student, the complexity of student modeling increases significantly with an increase in data dimensions. Therefore, overlay-based models function best when they can be implemented alongside extensive feature extraction methods or multi-modal student data. Furthermore, researchers must put heavy consideration into the configuration of their overlay-based model to ensure desirable predictions. This consideration can make overlay-based models very time-or cost-intensive to implement.
On the other hand, the emergence of data-driven models overcomes the deficiencies of overlay-based models. First, data-driven approaches derive a computational model from observed data only, which avoids the uncertainty and subjectivity existing in the human construction of an expert model. Second, data-driven models are more equipped to automatically deal with high dimensionality data. Finally, machinelearning-based methods only require a large amount of data to automatically learn inherent patterns, relations, and discover hidden variables, while overlay-based models are difficult to handle [50]. However, such demand in model training inevitably gives rise to the issues of data collection, quality control and overfitting/underfitting. The lack of theories around model creation for data-driven approaches introduces mistrust issues in their outputs and applications [20]. Thus, the integration of machine learning with model-based methods presents a promising research direction [64], [65].

III. MMLA BASIS
As clearly demonstrated in Section II, data are the key to student modeling, regardless of overlay-based or data-driven methods. While learners interact with AISs, many digital traces are left behind to be recorded and used to analyze their learning performance. As seen in [66], [67], the most common data source used in AISs is students' responses to question prompts. However, there are many facets of learner interactions with AISs, resulting in additional measurements that can and should be used to infer learners' performance. In fact, multi-modality data concerning sensorimotor experiences in learning are also valuable for representing learner attributes. As reported in [68], the use of multidimensional data improves the accuracy of predicting learning performance, resulting in AISs being more personalized and adaptive to individual learners. To this end, this section overviews the key foundation of multimodal learning analytics (MMLA), with a focus on various data inputs, collection contexts, and their relations to different learning indicators.

A. MULTIMODAL DATA INPUTS
MMLA is a process of analysis, apprehension, and optimization of learners' learning environment, processes, and outcomes through multimodal data collected from learners, their learning contexts, and their learning processes [69]. Some of the data are domain-independent whereas other data are domain-specific. Some data are static while others are dynamic. With the consideration of various affordable devices in the market to measure human learning behaviors and potential learning activities to be logged within AISs, this paper categorizes data inputs to MMLA into four types: learning data, physiological data, psychometric data, and environmental data. A combination of some or all of the aforementioned data sources is then used to interpret what learners know and/or will do. The typical learning indicators that have been widely utilized in literature are learning performance [68], [70], learners' attention levels [67], emotion [71], [73], collaboration and interaction degrees [74], [75], cognitive states [76], and engagement levels [77], [78].

1) LEARNING DATA
This set of data is usually related to learning content, processes, and how learners interact with them. General information on learning materials such as questionnaires, difficulty levels, and tutorials, is set before learning processes take place. Likewise, learner profiles regarding how learners prefer to deal with learning, such as learning styles, culturally dependent variables, and demographics, are also static and remain unchanged throughout the learning sessions. On the other hand, the learning data generated from student interactions with learning platforms is dynamic and constantly updated during the learning session, such as student test scores, student learning actions, and their timestamps in the learning process. These characteristics such as cognitive states and engagement levels [77], are directly related to student learning and are used to predict learning performance. Due to the nature of learning data, it is often the most accessible through simple means such as questionnaires, quizzes, or actions taken by the student in the instructional system.

2) PHYSIOLOGICAL DATA
Learning often elicits emotional responses and bodily alterations in learners which, in turn, may affect changes in a learner's physiology. Several theories have proven the link between emotion, cognition, and complex learning. Thus, monitoring such signals may assist in the synthesis of student models. As such, various approaches to MMLA incorporate access to these modalities of learner inputs. Such physiological signals are usually obtained through electroencephalography (EEG) [68], wristband [68], skin pads [79], pupillometry/eye tracking devices [70], and functional made use of a pressure sensitive computer mouse to measure cognitive load [80]. This type of data has been largely used to measure learners' affective states [73], attention [67], frustration [82], or boredom [83]. Some studies have also combined these physiological signals with learning data to analyze learner cognitive states [84] and learning performance [68], [70]. Physiological data as a whole offers a much more powerful modality for tracking engagement, focus, emotion, or other useful physiological indicators. However, the majority of physiological data measures uses some external hardware for data collection, which can make implementation more costly or inaccessible.

3) PSYCHOMETRIC DATA
This type of data is often collected through self-report questionnaires [68] and interviews [85] that measure psychometric factors of learners' ability, personality, and motivation and their relationships with academic performance. For instance, an experiment was carried out in [86], where students' brain activities and physiological signals, together with their perceived level of uncertainty, were recorded to predict their internal state of learning. The study concluded that students' uncertainty is associated with their mental and emotional reactions. Psychometric data also may include interviews with learners which can be used to detect the motives and behavior patterns of students as discussed in [85]. Like physiological data, psychometric data offers insight into a student's mental state at the time of learning. Unlike physiological data, the majority of methods, such as surveys and interviews, are more accessible for an intelligent tutoring system compared to specialized hardware.

4) ENVIRONMENT DATA
Research has found that the surrounding context of learning can have social and educational impacts on students. Realyvasquez-Vargas' et al. (2020) recently studied the impact of learning context on student performance in an online class during the COVID-19 pandemic, and found that lighting, noise, and temperature have significant direct effects on student academic performance [87]. Environmental factors are rarely considered as a usable modality due to the difficulty in controlling them in a real classroom setting. As such, there has been little exploration into the use and impact of environmental differences.

B. DATA COLLECTION SETTINGS
Data collection in MMLA studies is usually performed in two different types of environments: lab settings and realclassroom settings.
In a lab setting, data collection occurs in a fine-designed environment where learning tasks and procedures are predefined, with no exception. In doing so, it is much easier to collect high-quality, low-noise data. As a result, better performance in interpreting learner's behaviors can be achieved. For example, one study collected the click data of 17 participants during their game playing, together with the physiological data from sensors, such as EEG, eye tracker, camera and wrist band [68]. In predicting a player's ability to master complex tasks, LASSO regression was then utilized for feature selection, achieving an error rate of 18% [68]. Similar works can be found in [72], [86] with prediction rates of 72% and 65.08%-83.25% (varying with different models), respectively. Apparently, in the lab setting, the final dataset is relatively small, raising scalability and generalizability issues. One potential solution is the use of transfer learning, as exemplified in [88], where the model was first built on the existing large datasets and then tested using the collected data. It is assumed that the training sets were similar to the collected sets. On the other hand, since data collection is fully controlled in the lab setting, there is not much constraint in terms of sensor setup and deployment as long as the lab can afford [89].
In a real classroom setting, a much larger number of students are usually engaged in the learning process where students learn at their own pace in a non-judgmental manner. On the other hand, the control of learners in terms of learning environment, tasks, and procedures is often relaxed in comparison to lab settings. Consequently, there is a higher chance that some data modalities may be interrupted while learning for various reasons such as sensor battery capacity or learners out of a sensor's scope, making it challenging to synchronize data from different channels [72], [90]. Manual alignment is often needed in such cases, inevitably bringing some synchronization errors [91]. The costs, setup complexity, and intrusiveness are the major factors in the choice of sensors for large-scale pilots in real classroom settings.
While major efforts of MMLA remain in lab settings, it is expected that MMLA studies can radically transform our education. Therefore, further efforts to implement MMLA in real classroom settings is a significant and necessary research area. With the technological advancement of AI, data mining, and sensor informatics, more work is foreseen to tackle the aforementioned challenges in real classroom settings.

IV. OBSERVATIONS AND PERSPECTIVES
This section summarizes the state-of-the-art developments in student modeling and analysis in the past decade and offers perspectives and future research directions in these areas. For this reason, our observations of current research tendencies are first presented in Section IV-A, followed by an industry case in Section IV-B that supports our findings and future directions.

A. FINDINGS
Over the past decade, various models have been developed to accomplish the student modeling task, with the shared goal of maximizing prediction accuracy and optimizing learner adaptivity. They encompass, but are not limited to, overlaybased modeling and data-driven approaches for knowledge tracing. Based on the classic overlay model, enhanced models address limitations of the classic model by integrating other specific modeling techniques. For instance, fuzzy logic handles uncertainty and subjectivity in student prediction, CBM records student mistakes, stereotype clusters students based on similar traits, and ontology stores multiple dimensions of students' attributes. With the development of AI, sensor informatics, and data mining, state-of-the-art datadriven approaches, such as DKT and its variants, significantly outperform overlay-based models in knowledge tracing.
As elaborated in Section II, various student modeling methods continue to appear, and several deficiencies exist in their evolution. Consequently, the selection of appropriate student models needs to be investigated further. Based on our examination, we recommend considering three parameters.
First and foremost is the total volume of available datasets. As proved in [38], DKT is only valuable when large datasets are available. The consideration of overlay-based models over deep-learning-based models is valid when there is insufficient data to be gathered. One key thing to note is that even with a large-scale dataset, data could have insufficient quality, which would likely diminish the interpretability of the models. Second is the necessity and practicality of the prior domain knowledge model. Building such a model requires the effort of human experts. It is highly demanding that they do not make errors, and their inputs should be as objective as possible. Additionally, when the knowledge domain is inherently convoluted as containing complex concept patterns or latent concepts in learning materials, it would be challenging to build a model that embodies all interrelationships within the domain. The last is the dimension of the data input to a student model. As mentioned earlier, the increase in the types of data poses a complexity issue as well as data collection difficulty. In summary, the decisions regarding what data about a student to gather, how to collect, and how it is related to student learning are essential in designing a student model. In view of the advantages and disadvantages of overlay-based and data-driven methods, the hybrid effort of integrating both is envisioned as a future research direction [64], [65].
There are several student characteristics and learning features that can be used in student modeling and MMLA. Their selection is highly dependent on the intended functions of AISs, where student models and MMLA studies are used. In addition, special attention should be paid to non-intrusive sensors that do not interfere with learning while collecting data. Research has found that when people know they are being monitored and studied, they may act differently than they normally would. While the need, importance, and potential benefits of multi-modality data are hardly overstated, we have to realize that the challenges are doubled with the increase in data dimensions. Thus, methods such as principal component analysis (PCA), linear discriminant analysis (LDA), and many others should be explored for feature selection [92] and dimensionality reduction [93].
Research in a laboratory setting is often conducted in a confined environment with quality control of data collection, resulting in much more accurate data. This kind of research is still valuable for validating hypotheses and reinforcing our theoretical foundation, with the hope of bringing some of them into practice and reshaping our education. In view of transformative research in education, more MMLA studies in real-classroom settings are needed, regardless of the number of challenges we face. Technologies such as autosync systems [91] and advanced filtering techniques [94], should be considered to ensure the utility and usability of the data.

B. AN INDUSTRY CASE
This section presents empirical evidence of an MMLA study in a real classroom setting to support our observations. An AIS called Squirrel AI Learning (SAIL) developed by Squirrel AI was deployed at two after-school centers, while multimodal data were collected and analyzed. In our empirical case study, we focused on exploring the effectiveness of an AIS system in a real classroom setting. As such, experimental conditions were not carefully controlled or set to specific parameters to ensure that our data accurately reflected a real classroom environment. However, all students interacted with the same system in the same manner described below:

1) SAIL
SAIL is an adaptive instructional system that offers afterschool tutoring services in six subjects, including Chinese, math, English reading, English grammar, chemistry, and physics. Each learning session in SAIL takes approximately 2 hours with a focus on a particular knowledge component of a given subject domain. Such a knowledge component can be a unit of facts, concepts, or problem-solving skills.
As shown in Fig. 1, a student entering each SAIL session will go through a three-stage learning process. The student will first take a pretest to identify knowns and unknowns present in his knowledge. He then enters into learning mode in which the system adaptively assigns exercises (as shown in Fig. 2) and tutorial videos (as shown in Fig. 3). As the system continuously measure his performance, the difficulty level VOLUME 10, 2022 of the corresponding exercise is automatically updated. The process then repeats, keeping the student in learning mode until the measured performance exceeds a predefined proficiency threshold. In other words, SAIL continuously adapts to student performance and continues to provide appropriate content until the student has mastered the relevant material.

2) EMPIRICAL SETTING
This study involved 155 students from School N and 52 students from School G on the 8th grade (15-16 years old) math subject only. Each participant wore a ''brainwave'' headset (manufactured by BrainCo 1 ) while working on a personal computer with a webcam turned on for a SAIL session. As shown in Fig. 4, three data sources were collected for each student's learning: (i) user records from SAIL regarding student responses to questions; (ii) brainwave data from the student headset stored by the FocusEDU platform; and (iii) student facial data captured by the webcam and stored by Debut video capture software. The full dataset is available at this link. 2

a: USER RECORDS
User records are compilations of information recorded throughout the study. Entries are recorded whenever a student answers a practice question. Each entry in the user records contains a student's unique ID, as well as the subject and module they are working on at that moment. Entries also contain the question ID, correct answer, chosen answer, a marker indicating if the answer was correct, and time stamps for both the start of the question and the time taken to answer.

b: BRAINWAVE DATA
Brainwave data is measured from the BrainCo brainwave headset. Data are output in 3 files: attention, EEG, and events. Attention logs a student's attention value, which is derived directly by the BrainCo headset and output as a timestamp followed by a value in the range (0-100). The EEG file contains raw EEG data in the array form, showing the raw electrical signal measurements from each of 160 points at a 1 https://brainco.tech 2 https://github.com/RyanH98/SAILData given time. The events file contains raw events data recording if, at a given time, the device is connected or not.

c: WEBCAM DATA
Video data of learning sessions was also captured via a webcam attached to the computer used for each tutoring session. This video data captures each student's upper body and face, and each video is timestamped for easy synchronization. Video data were processed to obtain low-level features of face tracking and facial expression extraction. In the final data set, this data is available as a set of facial landmarks from each video. There are 51 landmarks per face including 10 eyebrows landmark, 12 eye landmarks, 9 nose landmarks and 20 mouth landmarks as shown in Fig. 5.

d: DATA SYNCHRONIZATION AND PROCESSING
Owing to the uncontrollability of the real classroom setting, a series of synchronization procedures were implemented to ensure the utility of the collected data. First, there were 149 (126 and 23 from N and G schools, respectively) facial videos out of 207 participants that could match a valid user ID. Second, each valid video stream was segmented by questions based on timestamps, such as each EEG stream. In doing so, the facial reaction and EEG data of a student were correlated, as shown in Fig. 6, with the student's responses to a question prompt in each knowledge component with the specified difficulty level. Finally, all filtered video clips were validated by checking the percentage of frames with a detected face. Clips with less than 30% frames with no face detection were then removed from the dataset. With the systematic cleanup, a final dataset was compiled from the three modalities, resulting in 720 question segments from 50 students with both facial landmarks and the corresponding EEG-based attention series. Note that the usable dataset is much smaller than the original data volume.
Features retrieved from the user records, such as difficulty levels of questions, student answers to the questions, the time taken to answer the questions, and student actions in visiting answer sheets and requesting hints, etc., were first used to classify the students into two groups by K-means (K = 2) clustering. For the facial data, special attention was paid to the movement of a student's head, eye, and mouth. Our study hypothesized that the movement of a student's head towards the computer screen is correlated with his increased attention level, while the movement of his head away from the screen is an indicator of relaxation or boredom.
In order to derive the eye features, we used the approach introduced by Soukupova and Cech (2016) [97] named the eye aspect ratio (EAR). EAR has been used for blink detection. We used Equation 1 as shown below to calculate EAR by using the 2-dimensional coordinates of 6 points landmarks of each eye region as shown in Fig. 7, where points p2, p6, p3 and p5 measure the distance of the upper eyelid and the lower eyelid, and p1 and p4 are the two corners of the eye. A low eye blink rate is also expected when engaging in computer 59366 VOLUME 10, 2022  activities [95].
Similarly, mouth movement is an indication of positive emotions, such as joy, which positively correlates with students' self-efficacy and overall achievement [96]. To that end, the first 150 frames of each video clip were used as the baseline. Then, the changes in nose length, eye aspect ratio (EAR) [97], and the distance between lip width, lip center, and lip left/right corner were then calculated for the head, eye blink, and mouth features, respectively.

e: RESULTS
Understanding of student learning performance is critical to steering adaptation in SAIL. Here, student performance is measured as the correctness of student answers. All features extracted from multimodal data can then be used to predict such performance. To demonstrate the advantages of MMLA, our study fitted a binary random forest model with 5-fold random split cross-validation to predict students' performance using different combinations of feature sets. The area under curve (AUC) was then adopted as prediction accuracy measurement.
As shown in Table 1, the addition of other modalities led to a noticeable increase in prediction accuracy. When only students' facial landmarks were used, the AUC was 0.593. If both the student clustering results and the facial landmarks were used to train the model, the prediction performance was slightly higher (0.63). In the last step, three modalities of data were used to fit our model, and the prediction performance was further improved by 0.06 to 0.69. The increased prediction performance can be directly attributed to the addition of more modalities in the training data set. The effect is also expected to be more pronounced with the addition of further modalities.
The use of additional modalities provided an overall increase in prediction accuracy from 59.3% with only facial landmarks to 69.0% with group ID and attention added. We did not measure a massive difference in classification accuracy when adding each modality. This could be due to  the modalities chosen; For example, facial landmark data and attention data could be correlated as we hypothesized, which would overall lessen the impact of attention data when facial landmark data is also used. Even with this marginal increase however, our study serves as a proof-of-concept to demonstrate the impact that additional data modalities can have on student classification accuracy. With the addition of more modalities or different modalities, it is expected that prediction accuracy would increase further.
Overall prediction performance of 69.0% is not indicative of state-of-the-art accuracy, and prediction performance was likely affected by the real-world data collection environment used in the industry case study. Since student participation was not tightly controlled, data recording was prone to noise, significantly so in the webcam and EEG data. Lighting conditions, player behavior, player appearance, and EEG headset alignment could all lead to affectations in the data. Fur-thermore, the real-world environment led to syncing issues between different modalities since different collection measures were started and stopped at different intervals. These syncing issues had to be resolved manually. Through the combination of syncing issues and noise, the overall volume of usable data obtained from the study was small relative to the total volume of data recorded. However, enough usable data was obtained to draw significant conclusions from the case study.
The case study results demonstrated here show the effect of additional modalities in an adaptive instructional system. More significantly, however, these results serve to demonstrate the challenges inherent in real-world AIS implementation, as discussed in Section III. Due to syncing issues, lighting changes, and other uncontrolled variables such as player movement, our data volume and quality were both heavily affected. This type of uncontrolled collection environment is reflective of what a researcher might encounter when developing and testing real-world AISs. Thus, additional research in this area is needed to combat the effects of real-world environments. To enable more accurate AISs for use in real classrooms, additional research is needed to cover in-classroom testing, automatic data syncing methods, additional data modalities, and methods like noise reduction [98] to help improve prediction accuracy. Additionally, new and evolving data processing and machine learning methods will also serve to solve the issues presented here, improving AIS performance throughout the field.

V. CONCLUSION
Adaptive instructional systems have been heavily explored in educational settings to provide instructional support that is closely aligned with individual student needs.
To successfully provide instructional support, AISs must gain a precise understanding of students with the technological assistance of student modeling. By dynamically representing a learner's characteristics, the student model can address several cognitive learning issues, such as identifying prior knowledge, isolating the underlying misconceptions, analyzing learning performance, and recognizing learning plans. Considering the importance of and many challenges associated with the construction of an effective student model, this paper presents the state-of-the-art development of student modeling in the past decade.
Many different models have been designed to represent and trace student knowledge, each of which has its own benefits and limitations. Our overview of the evidence pertaining to these developments reveals two major research trends in student modeling. First, stemming from the early efforts of using single-faced data sources (i.e., student responses to question prompts) to construct models, more studies have been conducted to consider student cognitive and affective patterns. Second, hybrid models, which take advantage of both expert knowledge and data power, have been created to strengthen overlay-based and machine-learning-based models. The increase in data dimensions not only enhances the forecasting capabilities of student models, but also brings many challenges to MMLA. This issue is further magnified when data are collected in real classroom settings. Many more research efforts should be devoted to feature selection, feature extraction, dimensionality, and complexity reduction. While the studies of AISs in real classroom settings, as exemplified in our industry case, are limited, the continued improvement, readiness, and flexibility of AISs will make them a competitive alternative for human teachers in offering personalized learning.
JING LIANG is one of the 2019 Top 30 Female Tech Entrepreneurs, in China, and 2020 Female Business Leaders of the Year. She has built several FMCG and young fashion brands in succession. She has created the brand ''Squirrel Ai'' and the original ''360D'' branding theory to expand the international influence and reputation of the brand. As a company spokesperson, she has been interviewed by more than a dozen first-tier media, such as CCTV, Hunan TV, Zhejiang TV, Jiangsu TV and global media, such as Fortune, Forbes, Financial Times, and MIT TR. She has been invited to speak at IJCAI, ACM KDD, Slush, TechCrunch, UBS and other international summits, and interviewed by Bloomberg and other media.  SHIMENG PENG is currently pursuing the Ph.D. degree with the Department of Intelligent Systems, Graduate School of Informatics, Nagoya University. Her research interests include analyzing multimodal wearable sensory signals, such as physiological data, facial expressions during student learning activities and applying machine learning methods to predict student's affective state, such as nervous, engagement, and interest and their impact on learning performance in order to guide improvements of their learning outcomes.

RYAN HARE
MINGYU LEI works with the China Academy of Information and Communications Technology (CAICT) and the Director of the Business Development Department, Institute of Technology and Standards (a Senior Engineer). He is a member of the Expert Committee of the Ministry of Industry and Information Technology (MITT). Undertaking tasks assigned by the State of drafting state standards and organization the formulation of the standards in their respective trades. He is a Consultant of Greater Bay Area Digital Economy Association. He is a Secretary General of Intelligence Education Working Committee, successfully build an influential industrial innovation platform in the field of Intelligence Education. Organized several forums or conferences in the field of smart education. He prepare the Intelligence Education Industry Report.