Multisensory Interaction and Analytics to Enhance Smart Learning Environments: A Systematic Literature Review

Smart learning environments (SLEs) leverage technological developments to enable effective, engaging, and personalized learning. SLEs rely on sensor and advanced interconnectivity capabilities to infer and reason. A natural set of affordances for SLEs are those of multisensory environments (MSEs). MSEs enable humans to make full use of their senses and equip learning systems with new information, methods of information exchange, and intelligent capabilities. In addition to novel forms of interaction, MSEs also offer novel forms of “learner traces” through multimodal learning analytics (MMLA). This article presents the results of a systematic literature review on how multisensory interactions and the respective analytics can support the use and design of SLEs. The findings from the analysis of 33 papers synthesize and clarify the latest advancements in the intersection of interactions and analytics in MSEs, discuss how those advancements can support SLEs' affordances and uses, and pave the way for improving our understanding of how various interaction modalities can support learning and under what conditions.


I. INTRODUCTION
R ECENT technological developments, such as providing various levels of adaptation for diversified learning conditions (e.g., curriculum, learning materials, teaching and assessment strategy, and support), indicate a clear need to fully explore how smart learning environments (SLEs) can be designed and employed to benefit students and teachers [1], [2]. As Spector [3] stated, when novel qualities and attributes, such as adaptability, flexibility, thoughtfulness, and engagingness, are related to a learning environment, that environment can be deemed smart. For such qualities and attributes to materialize, SLEs must sense the learning context and process it by collecting various forms of data; those data are then analyzed and enable SLEs [1]. Traditional SLEs are based on "mainstream" data, such as learners' and educators' computer logs that can then support richer representations (e.g., correctness and response times). In this article, we focus on how multisensory environments' (MSEs) interaction and analytics capabilities can be combined to support SLEs, and we examine the potential of this combination to enable data-driven interventions.
A variety of systems that use multisensory experiences beyond vision and audition have already been investigated and developed to facilitate relaxation, minimize agitation and anxiety [4], enable communication [5], and support learning [6]. The identification of new ways to enhance and facilitate learning is a central goal in both human-computer interaction (HCI) and learning technology research [7], with the synergy between learning analytics (LA) and embodied learning offering a promising approach [8]. Many studies have shown that allowing learners to engage in a variety of ways increases implicit awareness and enhances learning. Research has investigated numerous technologies using different approaches to determine their effect on learning in various disciplines, such as science [9], Mathematics [10], music [11], and language acquisition [12].
The use of vision and audition for interaction has dominated the field of HCI for decades, and the role of graphical user interfaces and auditory interfaces have dominated for centuries, despite nature's provision of many more senses for perceiving and interacting with the world around us. Touch-less games use motion sensors and devices such as depth cameras to detect body movements and gestures [13], [14], [15]. This limited use of interaction modalities is particularly relevant for learning with technology, as learning with technology is a complex process associated with many aspects of interaction (e.g., difficult mental operations and cognitive friction) and has the potential to benefit the most from the plurality and complementary use of different interaction modalities [8], [16]. The identification and utilization of important measurements associated with the learning processes, outcomes, and activities (with the use of LA) play a vital role in understanding, facilitating, and empowering human learning. Advancements in sensing technologies allow such systems to "sense" and "respond" to users' presence, as well as their gestures, affective states, motions, and manipulations, orchestrating the multisensory stimuli in number of ways. Understanding the effectiveness of the learning experience is crucial, and these technologies permit the collection of rich data from multimodal inputs reliably and consistently. Such data collection allows us to go beyond the limits of human observation [17] and support learning in situ [18].
The design and use of MSEs to support learning comes with several challenges. For instance, we need to determine which multisensory experiences are meaningful in facilitating learning; how to design educational interfaces that take into account the relationships between the various senses (e.g., integrating touch and audition with vision and even taste and smell); and understand what limitations come into play when we monitor information on how users interact during a learning activity. Contemporary research on MSEs, that leverages nonconventional interfaces to support learning (e.g., [19], [20]), exemplifies how learners' senses can enable novel interaction and analytics. Examples of such systems are audio interfaces, conversational interfaces, and motion-based and gesture systems, to mention a few. These recent works describe not only the respective benefits of such systems for learning but also the potential benefits of the produced analytics, or multimodal learning analytics (MMLA), as the literature refers to them [21]. However, future research efforts are needed to consolidate valuable information with respect to how multisensory interactions and the respective analytics can support the use and design of SLEs. Specifically, the proposed systematic literature review (SLR) investigates the following research questions (RQ)s. 1) RQ1: What stimuli are used to support learning in MSEs? 2) RQ2: What kind of learning analytics, and for what objectives, are employed in MSEs? 3) RQ3: What are the learning goals of multisensory interaction and LA in MSEs, and how do they relate to the learners' needs? To address these three RQs we conducted an SLR of MSEs and analyzed them based on the LA produced, the interactions and stimuli supported, the technological capabilities employed, the targeted learning goals, and the learner needs they attempted to address. Our decision to conduct a systematic review is due to the importance of quality evaluation in determining the rigor and relevance of the original research [22]. The findings revealed that the use of sensor data in MSEs research goes beyond supporting and improving measurements such as learners' progress and behaviors, to also bring new interaction modalities and support instructions for different learners' development. We analyzed the findings through the lens of SLEs and provided implications for the design and practice of future SLEs.

A. Smart Learning Environments' Affordances and Multisensory Interaction
SLEs sense information regarding the learning context by collecting various forms of data; they enable the SLEs with qualities and attributes such as adaptability, flexibility, and thoughtfulness [3]. In particular, they use this information for systems' enhancement and intelligence (e.g., through various algorithms), to directly support learning and instruction (e.g., through direct recommendations and dashboards) and/or to enable customizations and interventions that aim to improve learning and instruction [1].
MSEs [23] allow the user to interact with the system beyond the usual input devices (e.g., keyboard and mouse) and provide multisensory stimulations (e.g., through sounds, lights, and visuals) to support users' experience and task behavior. In the context of learning, MSEs focus on supporting learners' needs and progression, with early investigations identifying the benefits of multisensory interaction to support learning [24].
One of the theories that leverage MSEs is the embodied cognition theory [25]. Specifically, embodied cognition theory notes how our bodies and environments are related to cognitive processes; recently, Skulmowski et al. [26] critically assessed current studies on the effectiveness of embodiment therapies through a taxonomy to more accurately determine the possibilities, issues, and challenges of embodied learning research. The taxonomy's key features are bodily engagement (i.e., how much bodily action is involved) and task integration (i.e., whether bodily activities are meaningfully related to a learning task or not). In the same vein, Gelsomini et al. [19] summarized the main design aspects and attributes of MSEs to characterize technology for embodied learning by combining different recent taxonomies [27], [28], [29].
Although full-body interaction is still a novel concept in the context of learning, early studies indicate its promising potential. Maliverni et al. [27] gave a thorough overview of educational apps designed for full-body interaction, as well as the findings of their empirical evaluation and a summary of their potential. The study presented instruments to organize the many components of the design and assessment of full-body interaction learning environments, as well as advice for new research directions [27]. Another recent work by Georgiou et al. [30] reviewed the scientific literature on students' learning outcomes in the cognitive, emotional, and psychomotor domains as a result of their exposure to a technology-rich environment that was more embodied than others learning environments. Their research tackled the need for empirical data on the benefits of this new form of educational environment, as well as the necessity of gathering and synthesizing multimodal data. According to the review, the usage of technology-enhanced embodied learning settings in K-12 was connected to good learning outcomes across the cognitive, emotional, and psychomotor domains, and embodied learning research appears to be largely focused on improving cognitive outcomes in STEM education.
Other studies have highlighted how multisensory stimulation can influence the human emotional state: for example, Schreuder et al. [31] developed a framework that depicts how different sensory modalities might influence our emotional, cognitive, and behavioral responses. They discovered that emotional reactions to an environment are context-dependent and are not simply dominated by one sensory modality or the other. Furthermore, the interplay between many sensory stimuli may boost favorable emotional, cognitive, and behavioral responses, while incongruent multisensory inputs have a detrimental influence on higherorder responses but may improve memory [31]. Equipping SLEs with multisensory interaction can have implications for both pedagogy (e.g., enabling learners and teachers to use different interactions) and learning technology design (e.g., additional and richer services in the learning system). Therefore, despite the recognized promising potential of MSEs to support learning and augment SLEs, there is limited knowledge of which features and stimuli can be used and combined and for which learning goals and needs.

B. Multisensory Environments and Multimodal Learning Analytics
One way that the intersection between LA and MSEs has begun to interface is in the form of MMLA [32]. Inspired by microethnographic and interaction-analysis methodologies, MMLA aims to harness the power of sensor data and computational analysis to better understand and support student learning [8]. MMLA research focuses primarily on micro-level data and acts as a virtual observer and analyst of micro-level learning activities [21]. It involves emerging methods for collecting, sensemaking, and utilizing multimodal data (e.g., visual, aural, gestural, spatial, and linguistic) of students' processes in online, physical, and blended learning spaces [33], [34].
The confluence of MSEs with MMLA offers novel opportunities and significantly contributes to our understanding of the potential of embodied learning environments through the analysis of a wide range of learner affective variables, such as concentration, fatigue, difficulty, interest, time pressure, and motivation; this results in meaningful findings in different domains [35]. Contemporary LA research shows how multimodal measurements such as performance scores, face and speech recognition systems, eye tracking, skeleton analytics, and wrist data may be applied to support learners [33]. Oviatt et al. [36] focused on the mental states of children during the learning process, which is a complicated activity that can be assessed only over time. MMLA is an emerging topic with significantly more complex but also thorough sensemaking and predictive powers than previous analytics, which were mainly centered on click-streams and text analysis of written input.
Crescenzi [37] offered recommendations for conducting MMLA research with children under the age of six to measure their task engagement, emotions, attention, comprehension, and goal attainment. Their findings demonstrated the difficulty of obtaining data with children of this age utilizing noninvasive approaches, and they discussed the ethical concerns of multimodal data derived from auditory, visual, biometric, and quantitative child behavior assessments. Recently, Sharma and Giannakos [33] analyzed the literature and presented the capabilities of multimodal data and the use of those capabilities to support learning and instruction. In addition, they discussed the consequences that arise from capturing and using multimodal data to improve learning. They also identified six key objectives that MMLA research has been addressing (behavioral trajectories, learning results, learning-task performance, teacher assistance, engagement, and student feedback) [33].
Therefore, MMLA research provides a rich basis on how "learning traces" from various sources can account for learners' cognitive, affective, and behavioral factors [18]. Nevertheless, the potential of MSEs to leverage multimodal capabilities and support learning (e.g., what stimuli can be used and combined and how, what LA can be produced, and how LA can be used) goes beyond the contemporary MMLA research [33], and the proposed study aims to shed light on this gap.

III. METHODS
To address the aforementioned RQs, we chose to conduct an SLR (following the guidelines by [38]), which is a transparent and widely accepted approach for minimizing potential researcher biases and supporting the reproducibility of results.

A. Review Planning
To the best of our knowledge, no previous systematic and comprehensive review of the available empirical work on the confluence of MSEs and their produced LA (i.e., what data come from MSEs and how they are used to support learning) exists in the literature. In particular, our focus in this SLR is to identify the role of the various stimuli/interaction modalities and analytics in the educational context (i.e., which stimuli and interaction modalities were used and how they supported learning). As a first step of the SLR process, we defined the databases used and the inclusion and exclusion criteria.

1) Scientific Databases Used:
Our search focused on the following seven academic databases: ACM DL, IEEE Explore, SpringerLink, Science Direct, Wiley, SAGE, and Taylor & Francis. We selected these databases because they cover the major journal and conference papers published in the areas of educational technology and HCI (according to various lists, such as Google Scholar Metrics).
2) Inclusion and Exclusion Criteria: To find primary studies relevant to this SLR, we decided to include only empirical peer-reviewed works, although we took into consideration relevant nonempirical works in the related work and the discussion sections. We filtered primary studies acquired from the database search according to the selected inclusion and exclusion criteria indicated in Table I.
After filtering according to the selection criteria, we also analyzed each paper in relation to the quality criteria proposed by [39] that cover the three main issues of an SLR, namely, rigor, credibility, and relevance, discarding the articles that did not meet these criteria.
1) Does the study clearly address the research problem? 2) Is there a clear statement of the aims of the research? 3) Is there an adequate description of the context in which the research was carried out? 4) Was the research design appropriate to address the aims of the research? 5) Does the study clearly determine the research methods (subjects, instruments, data collection, data analysis)?
6) Was the data analysis sufficiently rigorous? 7) Is there a clear statement of findings? 8) Is the study of value for research or practice?

B. Search String Construction
Multisensory and analytics are the main terms that cover the relevant topics of research, and because the authors are interested only in how these two terms are used in the context of interactive learning technologies, adding the terms "learning" and "interaction" allowed us to identify literature in our area of interest. For this reason, the following initial search string was formulated: "multisensory" AND "analytics" AND "learning" AND "interaction." However, some authors might use the following terms as synonyms: "multimodal" to convey a similar meaning to "multisensory"; "data" to convey a similar meaning to "analytics"; and "education" to convey a similar meaning to "learning." The authors therefore decided to expand their search string into ("multisensory" OR "multimodal") AND ("learning" OR "education") AND ("analytics" OR "data") AND ("interaction"). The final query utilized was ("multisensory" OR "multimodal") AND ("learning" OR "education") AND ("analytics" OR "data") AND ("interaction").

C. Systematic Review Process
After the search string was defined, the first step was to execute the search queries in the selected databases. We searched the titles, abstracts, and keywords of the articles in the included electronic databases with a temporal filter from 2010 to 2020 (the search was conducted in April 2021) since LA is a relatively new field that emerged in 2010 and we are interested in works at the intersection of LA and MSEs. From this first step of the research strategy, we obtained 2580 papers. Using inclusion and exclusion criteria, we reviewed the titles, abstracts, metadata, and keywords of all the studies that resulted from step one to determine their relevance to the SLR; this resulted in 161 remaining studies (see Appendix A).
In the third step, both authors assessed each of the 161 studies independently and critically appraised them according to the eight quality criteria mentioned above. Each of the eight criteria were assessed with a "yes" or "no" classification. This step returned 27 papers of which we could say with confidence that they could make a valuable contribution to this review. Then, we employed a reference analysis referred to in the literature as the "snowballing" method, which increases efficiency [40]. Ultimately, a total of 33 papers constituted the paper corpus of the SLR; we thoroughly read, coded, and critically assessed these papers according to the review context of this systematic study.

D. Data Coding
During the coding process, the authors extracted data based on 13 variables (units of analysis). This process was iterative, with frequent consensus discussions between the two coders (authors of this article). The primary coder provided the initial coding and the authors examined and agreed on the final codes, shown in Appendix B. Disagreements among coders and unclear elements of the evaluated papers were discussed and resolved. Although this process does not provide reliability indices (e.g., Cohen Kappa), it does provide a degree of reliability in terms of coding consistency and what Krippendorff [41, p. 278] defines as reliability-"the degree to which members of a designated community concur on the readings, interpretations, responses to or uses of given texts or data," an approach that is considered acceptable in HCI research [42]. The variables selected represent either an important methodological decision of the study (e.g., settings and methodology employed) or an important aspect of the objectives of this SLR (e.g., multisensory technology used and systems' objectives). These 13 selected variables, their descriptions, and their scoring criteria are shown in Appendix B. The results from the coding process are reported in Appendixes D and E.
In order to understand the capabilities of the MSEs described in the selected studies, we included additional variables that allowed us to analyze the form of the information exchange between the user and the system, as shown in Table II. In particular, the system-to-user category describes what prompt the system provides to the user, and the user-to-system category describes the stimuli of the interaction modality (i.e., how the user interacts with the system). Moreover, to investigate what data and analytics were used, with what form of LA metrics they were used, and what were the objectives of this use, we further analyzed the selected studies, taking into consideration the objectives of the (sensor) data and the underlying objective of the system (at a meta-level).
Sensor data have the capacity to pursue three (sometimes complementary) objectives [43]; therefore, we classified the objectives of the sensor data following these three categories.
1) Contributing to rich measurements with respect to human learning (enriching studies, ES), by allowing us to gather insights that are often not considered in traditional LA research, such as eye activity, facial expression, or gestures, 2) Contributing to the interaction affordances of learning systems (enriching interaction modality, IM) by utilizing sensor data to support richer communication (e.g., through gestures or gaze), and 3) Contributing to the intelligence of learning systems (support functionalities, SF), by utilizing sensor data to support the learning system's affordances (e.g., affective systems and sensor-analytics dashboards). Regarding the underlying objective of the system (at a metalevel), we followed the main categorization of LA objectives found in the literature [44], [45]. Therefore, we used the following three categories: 1) monitor learners' progress; 2) detect affects/emotions of learners; and 3) model learners' behaviors.

IV. FINDINGS
First, to provide an overall picture of the primary research that has been conducted in the overlap of MSEs and LA, we provide a descriptive analysis of the coded variables.

1) Publication and Study Design:
The distribution of the selected studies according to the adopted research strategy is as follows: the majority of the papers were exploratory studies (n = 29 studies), followed by case studies (n = 4 studies). The dominant application 1 field of the selected studies was 1 The application domains are presented as they were identified by the authors, when it comes to the terms "memory" and "therapeutic" that are not selfexplanatory, with the term "memory" we refer to short-term memory skill (based on the classification of cognitive abilities of Cattell-Horn-Carroll model [75]) and with the term "therapeutic" we refer to mastering skills associated with therapy (with the use of standardized therapeutic tests). therapeutic (n = 10 studies), followed by STEM (n = 7 studies), spatial learning (n = 6 studies), memory (n = 3 studies), social science (n = 3 studies), humanities (n = 1 studies), and language learning (n = 3 studies).

2) Sample Population and Unit of Analysis:
The predominant sample population in the selected papers consisted of children with intellectual disabilities (n = 10 studies), undergraduate students (n = 7 studies), primary school students (n = 6 studies), adults with intellectual disabilities (n = 4 studies), graduate students (n = 3 studies), caregivers (n = 2 studies), and high school students (n = 1 studies), as shown in Fig. 1. Almost all the studies reported a sample size, which ranged from 5 to 500 learners. The median sample size is 40, and the mode is 15.

3) Methodology and Data Analysis Techniques:
The majority of the studies employed mixed-methods analysis (n = 23 studies), followed by quantitative analysis (n = 6 studies) and qualitative analysis (n = 4 studies). The findings show that mixed-methods analysis is the dominant methodology in the intersection of MSE and LA research. This demonstrates that most analyses combined quantitative data gathered via computer logs, self-reports, and sensors with qualitative data collected by methods such as interviews and observations. In particular, score logs and questionnaires are the dominant forms of data collection, with video recordings, interviews, and multimodal data also being frequently used (see Table III).

4) Technology Setting of the Learning Environment:
Within the setting of the learning environment, we categorized the technology and tools used in the studies (Appendix C). All the technologies analyzed were able to provide multisensory interaction within the combination of stimuli described in Table II. Most of the studies (n = 25) employed MSEs that use motionbased depth cameras (e.g., Microsoft Kinect or Intel RealSense), followed by traditional tools such as laptops, tabletops, and keyboards (n = 5), extended reality technologies that include virtual reality (VR) and augmented reality (AR) (n = 4), and web tools such as chat and social media (n = 1). Motion-based cameras were heavily used in the articles included in this literature review. This might be explained by their enabling of body movements, a high degree of immersion in the learning experience, and a variety of stimuli that allow users to interact with various modalities. In one example given by Gelsomini et al. [19], the motion-based system integrates visual contents projected on the walls and the floor with synchronized colored lights, as shown in Fig. 2; it offers multimodal interactions Fig. 2. Motion-based multisensory system [19].
with at least two simultaneous activations of seeing, hearing, touching, envisioning, or motor movements. 5) Research Objective: After coding and categorizing the research objectives of the papers, we determined that the literature on the interaction of MSEs and LA focuses on the following research objectives.
1) Discuss the efficacy of MSEs vs. conventional tools, which comprises studies that examined MSEs and compared them to traditional learning technologies (n = 14 studies). 2) Investigate multisensory and spatial perception, which includes works that focus on a deeper understanding of meaning-making in multisensory and embodied learning situations (n = 11 studies). 3) Explore cognition development, which consists of articles that explore whether or not multisensory stimulation promotes autonomy and complements therapy in such cases (n = 5 studies). 4) Support a ludic context between teachers and learners, which requires investigation of the role of caregivers in the multisensory learning experience (n = 3 studies).

B. What Stimuli are Used to Support Learning in MSEs? (RQ1)
The identified MSEs employ mainly motion-based games. Such games (usually quiz games) enable children to practice and learn conceptual knowledge using natural interaction. The motion-based depth camera is a simple but very efficient solution that allows a system to recognize hand and body movements and gestures [13], [14], [60]. More complex but also more immersive environments are also used in this category. For instance, these include dedicated rooms that integrate digital worlds projected on the wall and the floor with a wide variety of smart physical objects (toys, ambient lights, materials, and various connected appliances) to enable embodied interaction [51], [63], [65]. In Table II, we describe the stimuli provided by the MSEs and the users' interactions.
All the identified systems provide visual stimuli, usually conveyed through images and written texts [49], [52], [60]. Some technologies also support video and lighting [19], [71]; for example, Gelsomini et al. [48] developed a collaborative game with portable lights in which children learned colors by associating them to a particular element. Most of the technologies are also equipped with auditory stimuli with sounds or music, which can also represent feedback (correct or incorrect sounds) or even distractions (e.g., audio distractors to increase the difficulty level [13]). The audio in these systems can also be a recorded human voice or speech synthesis, which typically serves to instruct the user [58], [60]. Hence, motion controllers with projections and speakers are the dominant form of technology in MSEs for learning, which can provide many different stimuli (e.g., visual, auditory, and olfactory). At the same time, immersive environments are also employed to support smart physicality.
The olfactory stimulus was present in eight studies, usually through a spray device that can emit various aromas. One study [53] investigated how to encourage design thinking and decision-making by exploring how smell can be used as a novel interaction modality, using an application that had both desktop and VR implementations. Moreover, Merlt at al. [47], studied how interactive multimodal devices that include olfactory stimulation have the capacity to prompt communication among a daycare clinic's guests suffering from dementia, as well as how much it could improve their moods by evoking memories and emotions.
Lastly, physical and exteroception stimuli were present in a few studies [49], [51]. In these settings, information is provided by technology through a deliberate physical (through a specific area, such as an object) or environmental vibration or movement. For example, in one study, a plantar vibration was utilized in conjunction with auditory and visual modalities in order to improve learners' self-motion perception [57]. The exteroception stimulus is the most rarely used, and it occurs in an MSE called Snoezelen [62], which provides a massive amount of sensory stimulation, including environmental stimuli. According to the findings of this study, patients with dementia spoke more spontaneously, had increased awareness of their surroundings, and were more active after the MSE stimulations. Therefore, the input provided plays an important role in the learning activity, even if it is not directly connected with cognitive gains. Moreover, due to the increased complexity of multimodal interaction (compared to self-explanatory buttons or buttons that users are accustomed to), it is important for MSEs to scaffold interactions that the learner can easily understand, as well as provide stimuli that can be interpreted in a meaningful way (e.g., intuitive use of the interaction and predictable system behavior).
Interaction with the technology was related mostly to learner movements; in most of the articles analyzed, the usual interaction was spatial, including both the spatial, full body-based (such as walking, crawling, and moving arms [14], [59], [61]) and the spatial gesture-based (particular body limb [46], [50]) interactions. Yap at al. [56] designed a game in which children must use their body gestures to form the shapes of letters to spell a word. This provides the child with flexible play while delivering educational content. In some cases, users also interacted physically with objects [65] or small handheld toys that provided sensory tactile stimulation when activated by a child [51]. In one study, children were able to physically represent the concepts of predator-prey learning via bimanual gestures to represent inverse relationships [54]. The majority of motion-camera systems are also equipped with microphones (audio streams) to capture audio information [59]; however, none of the articles we analyzed supported verbal interaction with the technology. Hence, contemporary MSEs' capabilities allow for efficient and smooth motion-based interactions that can be also integrated with tactile interaction with physical objects. Nevertheless, there is a lack of MSEs utilizing verbal interaction.

C. What Kind of Learning Analytics, and for What Objectives, are Employed in MSEs? (RQ2)
We examined the LA, their objectives, and the data used in the studies in each of the three categories (Fig. 3). We also investigated whether and which sensor-based analytics were present and their underlying purposes in Appendix F.
We discovered that the studies employed commonly used LA measures, such as scores and logs, to both monitor and identify learners' affects and emotions. Moreover, quantifications from self-reports, observations, interviews, and psychophysiological data were also employed. For example, Malinverni et al. [65] focused on meaning-making in embodied learning experiences to inspire design improvements and integrate user contributions from a perspective that goes beyond the limitations of spoken language. To do so, they measured users' positions, movements, paths, gazes, pauses, and relative speeds during interaction with the system. The sensor-based analytics used in this study were meaningful both to advance the measurements (e.g., associated with learners' skeletons and movements) and also to provide a new type of interaction. In particular, learners could explore and interact with the environment using a butterfly net that allowed them to open peepholes in the fog and discover what was hidden underneath, thanks to the camera sensor, but at the same time, the respective movement analytics were gathered to make sense of the behavior and progress of the learner. Therefore, the measures used to evaluate learner behavior and progress are commonly combined in order to produce more relevant findings and to track what is not visible to the human eye.
Several studies used sensor-based analytics to enrich the interaction modality, but those analytics were not used further (e.g., to detect learners' progress or behavior). In particular, most of the studies utilized (but did not collect) sensor data from the users (how users move and their skeleton points), with the systems processing these data "on-the-fly" to enable motionbased interaction and mid-air gestures, resulting in increased immersion [48], [60], [63]. The majority of the articles used self-reports with Likert scales, coding videos, and focus groups with experts and instructors to detect learners' behavior and emotions during the learning process. Therefore, MSEs make use of sensor data primarily to enable new modes of interaction (e.g., with the use of learners' bodies) that result in better immersion and engagement with the system. However, contemporary MSEs do not consider those data as LA, as their purpose was not to either understand or optimize learning. Nevertheless, we observe that in these studies, the researchers made heavy use of different tools to evaluate students' learning (e.g., posttask surveys and observational rubrics).
In some studies, detecting learners' progress was the main objective of LA. The measurement of learners' progress was often implemented by a form of LA that consisted of both system logs and sensor-based data. For instance, Andrade et al. [54] looked at how hand movements and gaze direction enabled an understanding of how elementary students explore feedback loops while directing an embodied simulation of a predator-prey ecosystem with hand gestures. The study's findings revealed five distinct motion sequences in students' embodied interactions, which were statistically linked to students' initial and posttutorial degrees of loop understanding. Thus, the collection of MMLA can provide a richer understanding of the multidimensional relationship between the interaction (e.g., movement and gestures) and the learning outcome (e.g., cognitive gains).
Following the ambition to objectively quantify (or just proxy) learning, one study proposed that data related to users' psychophysiology combined with self-reports and objective learning outcome assessments could be used to accurately infer the actual cognitive processes that occur during the learning activity and knowledge acquisition. Baceviciute et al. [61] evaluated a VR application in which the researchers gave users the same educational content in three distinct modalities: text in an overlay interface, text embedded semantically in a virtual book, and audio. EEG analyses revealed significantly reduced mental processing when learning through auditory representations in VR. Additionally, from self-reports, they found that embedding textual information semantically in the virtual environment increased learners' self-efficacy (i.e., confidence in learning) and reduced the perceived cognitive load caused by the learning material and its design. In this case, the sensor-based analytics were useful to gather meaningful insights, which emphasizes the need to employ a plurality of data-driven methodologies when investigating cognitive processes, especially during complex activities such as learning in immersive technology. Therefore, exploiting the possibility of combining learners' sensor-based indicators with mainstream data acquired during the learning experience (e.g., logs, artifact analysis, and self-reports), allows us to gain insights into different facets of learners' progress, mastery, and experience.

D. What are the Learning Goals of Multisensory Interaction and LA in MSEs, and How Do They Relate to the Learners' Needs?
1) Additional Coding to Address RQ3: To assess RQ3, which focuses on the learning goals intended from MSEs' interactions and the respective LA, we first identified and classified the learning goals of the studies, as shown in Fig. 4). Then, we investigated how the goals are associated with the respective context (i.e., sample population needs and technology) and the LA employed.
Through examining the learning goals of the articles analyzed, we classified and grouped them into three main categories (see Fig. 4). 1) Cognition: Involves mental processes that could affect every aspect of life (memory, design thinking, decisionmaking, spatial learning, etc.). 2) Didactic: Includes skills that are usually taught using a didactic method (reading, writing, listening, etc.). 3) Socialization: Comprises the abilities needed to communicate and socialize with others. Most of the articles analyzed concentrated on didactic (n = 15) and cognition (n = 12), and only one focused on inclusion. Moreover, three combined inclusion with cognition, one didactic with cognition, and one didactic with inclusion.  Fig. 5 depicts the relationships among the sample population, the employed technology, and the intended learning goals. The findings suggest that technologies utilizing motion data are the most popular. This can be explained by the fact that motion-based affordances (through both interactions and analytics) account for embodied cognition. Embodied cognition is achieved through established gesture-and skeleton-modeling techniques that allow us to capture and further utilize learners' movements and interactions across contexts (e.g., [76]). Moreover, motion-based technologies appear to be very popular among children with intellectual disabilities and primary school students (at developmental stages at which children's limited cognitive and motor abilities can be supported with the use of motion interactions). However, when learners' ages increase and their abilities evolve, such as in the formal operational stage (above 12 years old) at which children begin to use deductive logic and reasoning, new technologies and stimuli with more complex affordances are used (e.g., AR and VR).

2) Motion-Based Affordances and Analytics to Account for the Embodied and Multimodal Nature of Learning:
3) Learning Goals Through the Lens of Learners' Roles and Abilities: Below, we describe the learning goals for each type of population examined in the papers selected, underlining the technologies used, and how the LA contributes to a better understanding and improvements of the studies presented.
Articles that studied children with intellectual disabilities worked mostly to improve the cognitive skills of the subjects with therapy interventions [51], [60], memory skills, and emotions [13] but also to increase collaboration and inclusion between peers [48], [65]. One of the papers included academic performance improvements (mostly in mathematics) [14]. All studies utilized motion-based technology for the potential benefit of embodied learning experiences and improvements in the children's cognitive skills, motor skills, and academic performance. The use of motion-based technology allowed for the collection of many data through motion cameras (skeleton data), and the system logs gave access to the children's processes. Given the vulnerability of this population, observations and videos were the predominant form of data used to understand their diverse learning needs [63], [64]. For instance, in order to detect potential usability concerns and assess children's attitudes, Malinverni et al. [52] combined the knowledge of therapists and researchers with field observations and note-taking of interactions. In particular, they recorded all sessions and conducted an in-depth analysis of the data while taking notes in accordance with a coding scheme checklist designed to identify the presence of specific behaviors. Therefore, the main objective of LA in MSEs for children with intellectual disabilities is to support their cognitive skills and employ multisensory capabilities to overcome interaction and cognitive challenges.
Studies with primary school children as a sample population concentrated more on didactic performance improvements, in particular for STEM subjects [54], and also improved language mastery (spelling of words of different lengths or learning the alphabet) [55], [56], [72]. Also in such cases, the studies used predominantly motion-based systems technology. One of the studies concentrated both on the didactic performance and the inclusion of children playing with the MSE [19], stating that children with more learning difficulties found memorization easier within the MSE. This success may be attributed to the impact of the variety of senses and interaction modes used, which made it easier for individuals to select the learning modality that suited their natural tendencies. The LA used with this sample population included not only skeleton data or system logs but also gaze postanalysis [54] to identify patterns in the variability of gaze, which enabled interpretation of the different ways students paid attention to the learning contents. Thus, in the context of MSEs, LA can aid researchers studying primary school-aged children in understanding the aspects (e.g., content and technology) to which children paid closer attention, which will help in designing more successful motion-based technologies for the intended users.
When the age of the users increased to high school, undergraduate, and graduate students, the technology ranged from motion-based technology to more traditional tools (laptop-or tablet-based) [23], [67]. Moreover, we see other advanced technologies such as VR and AR [57], [58], [61] that require a good level of cognitive and motor abilities from the learner. The main learning goals were related to improving specific didactic subjects, such as Geometry [59], Physics [70], and Mathematics [69], and also to improve design thinking, memory, and spatial learning [53], [57], [74]. Because the participants' ages allowed for their better understanding of their abilities and assessment of the technology employed, the scores and surveys in these circumstances stand out as being more significant than with other types of learners [58], [59], [70]. Junokas et al. [69] provided participants with a qualitative feedback form to record their own thoughts about the gestures, their experiences with the challenges, and ease of use. They also had access to an open section in which they could express their opinions about the encounter. Test scores were often used to assess learning, and Magana et al. [70] specifically analyzed students' responses to the pretest and posttest, scoring them using a 0-1 scale (incorrect and correct answers, respectively). The data analysis involved utilizing both descriptive and inferential statistics to examine the data collected. Hence, as learners' cognitive and motor abilities increase, there is a shift from motion-based technologies to more traditional interaction modalities and inputs. Learners' development plays an important role in the selected content (e.g., in high school, we see a focus on STEM) and in the data collection methods (e.g., surveys and scores).
Most of the studies that involved adults with intellectual disabilities used motion-based systems. All the studies aimed to improve the cognition skills and the autonomy of the participants involved, for instance, to treat their dementia [47], [62] or encourage self-efficacy [46]. Moreover, a combination of audio/visual and olfactory stimuli was used to improve participants' memory [49]. Some studies also used physical stimuli with environmental vibrations [62]. User interaction was passive and static; in fact, with this population, most of the systems involved more physical touch interaction with objects [47] than motion-based interaction. To understand how the participants' learning increased (memory or mathematical abilities), pretests and posttests were crucial to complement LA. For example, the digit span memory test and the Romberg balance test were used by Toro [49] to assess memory and standing balance in individuals with moderate learning impairments following a multimodal stimulation in the MSE. Moreover, the caregiver's perspective was shown to be significant to enrich LA; they could comprehend the nuances and consequences of the participants' actions and responses to the learning tasks. Mased et al. [62] asked therapists to rate the participants' mood and behavior before, during, and after the MSE sessions using the interact scale. The authors also gathered data from two biomedical parameters, heart rate (beats per minute), and Spo2, immediately before and after sessions in the MSE using mobile finger pulse oximeters. Consequently, we can say that to support adult learners, especially those with special abilities, motion-based systems with combined stimuli can be used to support memory and cognitive development. Additionally, we see that integrated LA, such as pretests and posttests together with caregiver ratings and observations, improve understanding of the learning process.

V. DISCUSSION AND FUTURE DIRECTIONS
Multisensory and analytics capabilities in SLEs can help to create new methods of engaging learners by increasing implicit awareness and allowing data-driven teaching and learning (e.g., supporting the learning design and real-time decision-making and interventions). Despite the rise of MMLA research as a core LA topic, 2 its application in SLEs is currently limited and requires further consideration. Although potentially useful multimodal data are utilized from SLEs (e.g., to power multimodal interaction), those data have been used to a limited extent to either understand or optimize learning. Our findings pointed out how beneficial it could be to adapt the stimuli/interactions combination to the different users and learning goals leveraging the MMLA collected. The results of the SLR are discussed in the following sections, with an emphasis on potential implications and future directions in the field of SLEs.

A. Multisensory Stimuli to Support Learner Needs
Based on our findings, motion controllers and immersive environments, which can deliver a variety of stimuli, are the predominant technological form that supports "multimodality" in MSEs for learning. MSEs enable sensory experiences and strengthen the central nervous system's ability to acquire and integrate sensory information [77]. Learning is heavily dependent on students' capacity to acquire information, process it, and integrate it into a well-planned and ordered activity [78].
Research has previously explored MSEs' capabilities, proving that motion capabilities can facilitate seamless communication between learners' body motions and gestures (e.g., clicking, grasping, pointing, walking, or balancing) and learning systems [51], [65]. Multisensory interaction allows learners to use their full potential through a more natural interaction that offloads users' cognition (e.g., reducing cognitive friction and supporting difficult mental operations). This is especially beneficial for learners in developmental stages at which they have limited cognitive processing ability (e.g., young students and both children and adults with intellectual disabilities). From our analysis, we identified that the article corpus pays particular attention to the unique needs of a specific group of learners (e.g., specific developmental age or special abilities), with MSEs applying a variety of stimuli to address different learning goals [52], [62]. Going beyond the widely used stimuli of spatial interaction, we see that combined input, such as visual/audio, olfactory, and tactile or haptic interactions, have the capacity to improve learners' autonomy and cognitive development, as well as their working memory [47], [62]. The chosen publications, however, did not indicate which particular stimulus combination or interaction, among others, is more efficient for a certain learning goal. The first challenge raised by our findings is therefore the lack of knowledge about how different combinations of stimuli/interactions affect the learning experience of the chosen sample group. Moreover, although most of the technological apparatuses of the studies included the equipment needed to support vocal interaction (e.g., microphones and audio stream), and despite the rise of speech interfaces [79], we could not find any article using vocal interaction. This might be based on the fact that robust voice recognition and processing are required to properly realize the understanding of natural language [80] and the challenges of using those interfaces with young children [79]. Verbal communication is the core of relationships and is essential for learning, fun, and social interaction [81], and many researchers have started to investigate how vocal interaction could support the teaching and learning process [20], [82], [83]. However, the majority of prior research that used vocal interaction involved technologies such as conversational user interfaces (web pages, or applications) [84] or mainstream conversational agents (e.g., Apple's Siri, Amazon's Alexa, and Google Assistant) [85]. The latter provide basic instruction and are far from supporting the conversation and inquiry needed for learning (e.g., dialogic). Although vocal interaction is a promising direction for MSEs for learning, there is a lack of research due to several technical and pragmatic challenges (e.g., learners that are linguistically and culturally diverse) [86]. Therefore, more research is needed to determine how to design MSEs that feature vocal interaction, as well as how its combination with the stimuli we discussed may influence the learning process. From the SLR, we also identified that AR and VR systems requiring advanced navigation and spatial abilities are less used with, and unsuitable for, young learners. In general, numerous studies have suggested that such technology has the potential to be used as a pedagogical tool and an immersive space for learning [87] that provides learners with an authentic context in which they can expand their learning scope, visualize situations, and concepts that are difficult to depict using other mediums, and gain more meaningful knowledge [88], [89]. Although some studies provide evidence of the usefulness of adopting VR technology [90], [91], these studies are still limited in number and provide customized solutions to support specific learning needs (e.g., training spatial skills).

B. Potential of MMLA in MSEs
Technologies that include movement and other motion capabilities provide rich contextual information enabling MSEs to utilize a wider range of data (compared to traditional learning systems) [92]. Gathering rich data from the learning environment itself and from the interaction with the learner (position, movements and paths, gaze, pauses, and psychophysiology, etc.) provides a number of promises and challenges to learning technology research (ranging from ethical and practical to methodological issues, see [18]). Our findings suggest that sensors are frequently used in research to enhance the interaction modality. In these cases, the learners' behavior was detected mostly through the analysis of interviews or surveys or by coding videos and observations [60], [63]. The integration, aggregation, and harmonization of learning-related data from many sources and devices have the potential to provide rich(er) evidence-driven design that can help people learn more effectively [93]. For example, achieving more advanced functionalities, such as those of affective systems. Tracking what is not visible to the human eye via sensors allows us to construct metrics and analytics (combined MMLA) that account for learners' behavior and development [54], [61], [65]. Therefore another challenge is the need to investigate the relationship between LA (coming from both sensor-based and mainstream data) and learning, and how these data might be used to assist learners in reaching their learning goals. Analytics from these data could provide insights that enable fine-grained and timely data-driven decision-making and support [33]. Our results show that the data collection methods and LA used also changed according to the sample population and learning goals. This learner-centered approach raises further discussion of how LA and MMLA might aid in the creation of customized multisensory learning experiences. Although such data can enable the use of automated feedback (e.g., errors correction, hints, or feedback provision) [94] and adaptations to support effective combinations of stimuli and interaction [95], [96], [97], [98], this potential remains unexplored in MSEs.

C. Important Role of the Secondary User in Multisensory Environments
From the SLR, it is evident that a secondary "user" (e.g., instructors and teachers) is present during the learning experience mostly as an orchestrator or an observer [52], [62]. As a result, it is critical to begin incorporating the secondary user into the design of MSEs (e.g., specific interfaces for adjusting content and difficulty), as well as the collection of MMLA (e.g., analytics and dashboards). This aligns with the recent developments in the area of human-centred learning analytics (HCLA), which highlights the potential of utilizing knowledge and practice from the design communities (e.g., participatory design and codesign) to support the design of LA [99]. This will allow us to engage with LA design processes that take into account the needs and challenges of educators (e.g., what LA metrics are helpful for them, how they would like this information to be presented and when) to help them enhance their teaching capabilities. Given the complexity of these technologies, the role of the secondary user is crucial in supporting the systems to employ functionalities such as customization (e.g., second user input and manual setting of difficulty) and supportive pedagogy (e.g., giving hints and support needed to prevent the user from dropping out). Several researchers have begun to look into this topic, and they have included the figure of the caregiver in both the preliminary and postanalysis stages of designing MSEs for learning [50]. Following the roles of participants as defined by Iversen et al. [100], we use the principle of primary and secondary persona in interaction design [101]. In this review, we found that 17 out of the 33 articles included the figure of a teacher or a therapist as a secondary user, depending on the sample population, which mainly included adults with intellectual disabilities, children with intellectual disabilities, and primary school children as primary users. The role of the educators was mostly as active coresearchers; they were involved in the studies not only as observers or interviewees but also as a crucial part of the learning experience, orchestrating the learners' interactions. In numerous examples (e.g., learners with special abilities), excluding the secondary user makes it difficult for the LA alone to convey an accurate image (and provide insights that can support the learning process). On the other hand, the use of extensive sensor data might help teachers monitor their students' learning progress [102], [103]. Viewing the collected data on supportive tools, such as LA dashboards, affords teachers new insights, which help them make informed data-driven learning decisions by providing formative and summative feedback; however, additional effort is needed to ensure that the produced LA are effectively designed [104], [105], [106]. Chatti et al. [44] have already discussed the role of both students and teachers in their framework, and involving them in the design process has the potential to increase the acceptance and adoption of LA solutions. Moreover, taking into consideration recent works in HCLA [99] and the incorporation of artificial intelligence (AI) capabilities into LA, new avenues are opening up on how AI may supplement educators' tasks while employing MSEs (e.g., hints or feedback provision). These directions are a stepping stone for investigating how such "mixed-initiative systems" can empower teachers by combining human and AI/LA capabilities [107], [108]. Therefore, the design and practice of these systems to support learning, and their respective LA architectures, need to consider the secondary user and use proper functionalities, analytics, and best practices to empower the secondary user in playing an important role.

D. Implication for Design and Practice
In this section, we discuss the implications of this work and pave the way for creating a representation of how SLEs can leverage multisensory interactions and LA to support learning. The SLR highlighted how the learners' needs, interaction modalities, and LA are associated and their relation to the learners' needs and the intended learning goals. The role of learners (and instructors), as well as the intended learning objectives, are crucial in determining which stimuli and interactions to employ. This would set the ground for what data to capture and how to use them for the monitoring or enhancement of SLEs. Based on the findings of our research, we introduce a representation of the main components and their association with the design of the learning process. This representation can serve as a springboard toward understanding the main components involved (e.g., multisensory stimuli and interaction and LA) and enhancing them to support the design of future SLEs.
Previous works have outlined the advantages of SLEs in terms of identifying which technologies are employed to aid students (or instructors) in performing learning (or teaching) activities, as well as synthesizing the primary components and essential technological functionalities [1]. They place great emphasis on systems' intelligence in gathering input from the learning context (via sensing), decoding, processing, and logically recommending actions to alleviate learning restrictions and enhance learning performance (analyzing and reacting). Mangaroska and Giannakos [109] highlighted the importance of evidence-driven learning design and suggested a taxonomy that illustrates the synergy between LA and learning design (i.e., how LA can support the learning design). In the context of SLEs and MSEs, the potential of evidence-driven learning design is even greater. This is due to the potential of MSEs to activate and scaffold different communication and interaction modalities, as well as the potential of the produced MMLAs to provide granular and timely insights [33]. Moreover, such multimodal and rich data can provide novel affordances that enhance learning (e.g., affective learning [110] and embodied learning [8]).
Taking into account the works described above, we illustrate an iterative process (as shown in Fig. 6), comprised of SLEs' design factors, SLEs' experience flow, and postdata analysis that may affect the future design factors.
The first objective of the representation is to list the major design factors to consider when designing SLEs that include multisensory interaction and LA.
1) Learning actors: The learning actors involved are the end-learners, considered as the primary users, and their instructors (e.g., teachers and parents). The learning actors are capable of taking action to interact with the space. Multisensory SLEs allow the various learning actors to receive or submit different prompts (or stimuli) and act on the related context and space. 2) Learning goals: The learning goals involve educational objectives and the teaching-learning approaches. In the context of multisensory SLEs, the functionalities of the SLEs themselves influence the learning goals, as well as the context of use (e.g., level of autonomy and scaffolding). 3) Technology: This is the combination of digital and physical technological affordances that enhance learning and teaching. Categories of technologies employed include those that support learners and those that assist instructors in making in situ decisions, as well as postanalysis of the learning experience (e.g., sensors and supportive tools). 4) Space: The physical and virtual (or mixed) settings that orchestrate several technologies allow the multisensory SLEs to perform their functionalities (see [1]). 5) Multisensory stimuli: Table II describes the stimuli that the space provides to the learner (space prompt), and as we highlight in our findings, they could be combined according to the learning goals and the needs of the users.
In the same vein, interaction modalities allow the learner to interact with the MSEs through different senses. 6) Learning analytics: Paraphrasing SoLAR's definition, 3 learning analytics refers to the measurement, collection, analysis, and reporting of multisensory data about learners and their contexts for the purposes of understanding and optimizing learning (e.g., empowering learning to be made more effective, efficient, and engaging with the support of smart environments). The second objective is to show how the factors cooperate to design a meaningful SLE' experience. As we already discussed, the needs of the learning actors, play an important role in determining the learning goals. This also influences the choice of the multisensory stimuli combination offered by the space and the associated technology (e.g., sensors). The stimuli produced might affect the learning success. As researchers, we need to investigate how the combination of stimuli (both space prompt and interaction modality) activate learner senses and their relevance to the stated learning goals. To do so, the selection of the LA collected during the interaction with the SLE is crucial in the facilitation of the intervention. LA techniques are, in fact, a necessary tool for SLEs to build interventions that provide actionable insights for both instructors and students. In particular, the space can react based on the interpretation of the MMLA processed, providing tailored information to learners via space prompts or automatic feedback. Moreover, suggestions could be sent to instructors via additional supportive tools (such as dashboards) that allow them to make in situ decisions and give instructions. Lastly, we wanted to underline the importance of the postdata analysis as a necessary step for the instructors and researchers to understand learners' misconceptions and experiences based on the data collected. This step closes the loop of the iterative process to design future SLE experiences that leverage multisensory interactions and LA.

E. Ethical Considerations
This review shows that most of the identified studies involved vulnerable populations, such as learners with disabilities and young learners. Few of the studies describe how the data are processed to ensure confidentiality (e.g., storing only the participant ID and related test results [19]). Many of the studies do not explicitly mention data processing, so it is unclear how they manage all the data acquired from sensors and cameras, which includes personal information (e.g., [14], [54], [60], [73]). For the acceptance and future adoption of SLEs, more awareness and debate are required not just of their legal aspects but also of the various practical and technical concerns, as well as the dispositions of the various actors (e.g., students, teachers, and parents). This includes feelings of frustration and discouragement caused by the difficulties of using new technology or engaging in an activity that differs from their usual method of learning, as well as the collection and use of LA.

VI. LIMITATIONS
This review has some limitations regarding both the method and the interpretation of the findings. We had to make various methodological decisions (e.g., database and search query selection) that might have affected the results. First, the search query we employed captured relevant papers that used the keyword "multisensory" or "multimodal" (which are common terms in the LA/MMLA communities), but we might have missed potentially relevant papers that used terminology such as "affective learning systems" or "sensor learner modeling" (the methodology we employed refers to these as Type I errors). Therefore, although our analysis covers various papers that used the terms multisensory and multimodal, we also acknowledge that potentially relevant works [e.g., from communities such as AI in education (AIED), intelligent tutoring systems (ITS)] might have been excluded.
Despite the advantages of reusability and transparency of the selected process (the search results, coding, and analysis are all available in the Appendixes), the methodological decision of not utilizing an inter-rater reliability index might pose another possible bias (hindering the reliability) in the results. This "nonsystematic component" can be considered as a deviation from the SLR process, nevertheless, this is an acceptable practice in HCI research [42] and in line with other acceptable types of reviews such as semisystematic, integrative, and narrative literature review processes that are also [111]. Another significant limitation might arise from the lack of certain information in the selected papers; this missing information resulted in some gaps in the coding of the publications. However, we tried our best to eliminate such bias by following a predefined research protocol. Moreover, the emphasis was explicitly on empirical evidence, and the publications were coded by two separate researchers. Finally, missing data were minimal and had no impact on the results.
Another weakness of this research is its major emphasis on multisensory and multimodal environments and the omission of the term SLEs from the search query. First, our decision not to include the term 'smart learning environments in the search query was conscious because adding one more term produced too many Type I errors (this is because several authors use different terminology to describe smart learning systems, e.g., intelligent and adaptive AI-powered) and greatly reduced our scope. Ultimately, all the MSEs studies focused on how multisensory data empower learning environments to be made more effective, efficient, and engaging (i.e., smarter); therefore, the inclusion of SLEs as an extra term was not needed. Hence, the selected query allowed us to examine how the features of multisensory interaction (and multimodal data) led to a meaningful experience when integrated in SLEs.

VII. CONCLUSION
The current review describes the state-of-the-art literature at the confluence of MSEs for learning and LA by analyzing 33 relevant empirical articles published during the last decade (i.e., 2010-2020). The main aim was to gather insights into how contemporary SLEs' design and practice can benefit from multisensory capabilities. In particular, this study looked at the utilization of LA when an MSE is involved, as well as the use of specific stimuli and interactions in relation to the learning goals, the technology used, and the needs of the sample group. This review describes the current state of MSEs and the respective LA, but also sheds light on the need for future work. In the future, we need to focus on defining guidelines and models of how SLEs can fully integrate (and benefit from) multisensory capabilities and LA to further empower learning environments and practices. Future works should provide details about the use of SLEs, with a holistic consideration of the different design factors of SLEs (see Fig. 6). Moreover, future work should focus on how SLEs can utilize the various MMLA for purposes of understanding the learning process and optimizing the learning design. At last, in this work, we identified and discussed the different "user roles" and the potential of incorporating the secondary user into the design and implementation of future SLEs. This also requires additional research into the methodologies and techniques, and enable us to involve diverse participants (both learners and secondary users) in the design of SLEs.