A Review of Emotion Recognition Methods from Keystroke, Mouse, and Touchscreen Dynamics

Emotion can be defined as a subject’s organismic response to an external or internal stimulus event. The responses could be reflected in pattern changes of the subject’s facial expression, gesture, gait, eye-movement, physiological signals, speech and voice, keystroke, and mouse dynamics, etc. This suggests that on the one hand emotions can be measured/recognized from the responses, and on the other hand they can be facilitated/regulated by external stimulus events, situation changes or internal motivation changes. It is well-known that emotion has a close relationship with both physical and mental health, usually affecting an individual’s and a team’s work performance, thus emotion recognition is an important prerequisite for emotion regulation towards better emotional states and work performance. The primary problem in emotion recognition is how to recognize a subject’s emotional states easily and accurately. Currently, there are a body of good research on emotion recognition from facial expression, gesture, gait, eye-tracking, and other physiological signals such as speech and voice, but they are all intrusive and obtrusive to some extent. In contrast, keystroke, mouse and touchscreen (KMT) dynamics data can be collected non-intrusively and unobtrusively as secondary data responding to primary physical actions, thus, this paper aims to review the state-of-the-art research on emotion recognition from KMT dynamics and to identify key research challenges, opportunities and a future research roadmap for referencing. In addition, this paper answers the following six research questions (RQs): (1) what are the commonly used emotion elicitation methods and databases for emotion recognition? (2) which emotions could be recognized from KMT dynamics? (3) what key features are most appropriate for recognizing different specific emotions? (4) which classification methods are most effective for specific emotions? (5) what are the application trends of emotion recognition from KMT dynamics? (6) which application contexts are of greatest concern?.


I. INTRODUCTION
The term "emotion" is not uniformly defined, when asked "what is emotion", answers can be different [1]. Among them, reference [1] gave a relevant comprehensive description in which emotion is defined as "an episode of interrelated, synchronized changes in states of all or most of the five organismic subsystems in response to the evaluation of an external or internal stimulus event as relevant to major concerns of the organism". Based on this definition, emotion can be described as a multidimensional construct composed of cognitive, motivational, somatic, motoric and subjective elements [2]. This suggests that emotions can be noticed, measured, and facilitated by external stimulus events, situation changes, and internal motivation changes [1].
Measuring the changes in someone's emotional states is the first step towards emotion recognition. And to measure the changes, there is a need to establish a benchmark for classifying different emotional states. Many researchers used some emotion elicitation intermediaries such as short videos and images to stimulate different emotional states and thus to find out subjects' reactions to certain states [3][4][5]. Then these reaction data together with emotional states were used as labeled data to train emotion classification algorithms [6].
The reason for recognizing emotion is that emotion plays an important role in people's everyday life. People's behaviors are often influenced by their emotional states [7] and the behaviors in turn affect both individual and teamwork performance [7]. It is widely believed that positive emotions on workplace can lead to smoother social interactions and more helping behaviors which thus contribute to positive consequences [8,9]. Especially, when the whole world is facing the Covid-19 pandemic challenge, the conventional face-to-face teamwork model is affected since most people must work from home. In this case, the web provides an alternative way for "working together" [10,11]. People work online, use some instant messaging applications to contact their colleagues, and even discuss with them in a virtual meeting room [11]. Nevertheless, isolating from each other geographically may arise some negative emotions or psychological problems which have negative influence on their remote cooperation, such as poor cooperation performance and working quality [12,13]. Thus, knowing how to recognize emotion at edge and how to interfere emotion to help people keep a positive emotional state and concentrate on their work is important to benefit remote teamwork.
There are various measurement modalities can be used to measure emotions such as changes of facial and vocal expression patterns, body gestures, and complex multimodalities [14]. Thus, recognizing emotions has attracted many research attentions from different data sources and emotion recognition methods. They can be generally classified into three categories: a) Use external body signals for emotion recognition, including facial expression, body gestures, gait, eyetracking, etc. These signals can be easily noticed by others but not always reflect one's real emotional states [15][16][17][18][19][20][21][22][23]. b) Use internal physiological signals such as heart rate, sphygmic, skin conductance, blood pressure, Electroencephalography (EEG), etc. These signals can be more precise in reflecting emotions but not be caught as easily as external body signals [24][25][26][27][28][29][30][31]. c) Utilize other contextual signals other than body signal themselves such as voice, text content, KMT dynamics, which can be collected non-intrusively and unobtrusively [32][33][34][35][36][37][38][39]. There have been some systematic surveys or reviews in the field of emotion recognition from facial expression [17], body gestures [18,19], eye-tracking [22], internal physiological signals [25,27], voice [32][33][34], text [35,36], so readers interested in the related topics can refer to them. Thus, this paper will not detail them. Instead, it will focus on emotion recognition from KMT dynamics because KMTrelated data can be recorded by a tool running in the background without disturbing users' normal work [37]. This unobtrusive characteristic makes it easy to measure more natural emotions in a normal work environment and suitable for some scenarios such as working environment where data privacy is a main concern.
The dynamics information contained in KMT data can be used as biometrics such as performance indicators [13] and enhanced passwords [38][39][40]. It can also be used for emotion recognition [4,41,42], which is the focus of this paper. There have been some review papers [43,44,[46][47][48][49] related to emotion recognition from KMT dynamics. Some [43,44,46,47] were published in 2013-2015, reflecting on some earlier research, while the recent paper [48] focused on advances and applications of KMT dynamics, and paper [49] only reviewed the research over the last decade. These studies summarized general information about emotion recognition procedures, features, datasets, and classification methods, but did not provide answers to further questions we concerned (See Table I). Thus, there still lacks more detailed literature review on the state-of-the-art research in emotion recognition from KMT dynamics.
In order to fill this gap, this paper provides a systematic literature review on emotion recognition from KMT dynamics. Its contributions include: a) A systematic literature review of related literature over the past twenty years on emotion recognition from KMT dynamics to find out existing emotion elicitation, recognition methods and their performance profiles. b) Raised and answered six research questions for advancing research practice. c) Identified key research gaps, challenges, and potential research directions for future research. The remaining parts of this paper are organized as follows: section Ⅱ describes the method of how to search and identify related articles; section Ⅲ provides a critical analysis of existing work from three perspectives: emotion elicitation methods, recognition features, and classification methods; section Ⅳ discusses some key applications of KMT dynamics-based emotion recognition; section Ⅴ summarizes key research gaps and challenges and discusses potential future research directions; the last section concludes this study.

II. METHODOLOGY
We used a systematic literature review method [49] to conduct this study. First, we identified six research questions (see Table Ⅰ) and used them as clues to generate literature search terms (See Fig. 1). The search string is built by some related key words or phrases and logical connectors.
Second, in order to answer these RQs, the WEB OF SCIENCE CORE COLLECTION was selected as the main data source to search relevant literatures since it includes the most trusted global citation databases in the world and provides comprehensive citation data for many different academic disciplines. In addition, GOOGLE SCHOLAR is the complementary database since it is the largest academic search engine. And about 20% cited articles in this paper come from GOOGLE SCHOLAR. Which application contexts are of greatest concern? FIGURE 1. Search string for literature research where "TS" means "Topic", "OR" and "AND" are logical connectors, and others are key words or phrases.
Third, we filtered some unrelated literatures by reviewing their abstracts. By using this search strategy, we totally found about 142 articles written in English. We did not set time and discipline restrictions because we would like to include related work as much as possible. Thus, indeed some of these articles are not related with emotion recognition from computer science perspective, so we further filtered out 40 articles by reading their abstracts and conclusions. As a result, we finally got 102 papers that have some relevance with emotion recognition from KMT dynamics. These papers are classified in Fig. 2 based on their research focuses. The research volume with a general increasing trend over the past 20 years is indicated in Fig. 3, and the trend of research about emotion recognition from single or combined KMT dynamics are illustrated in Fig. 4.

III. CRITICAL ANALYSIS
Computers are widely used in work and daily life, and the use of laptops and smartphones is also significantly increasing [41]. For example, about two billion people all over the world now are smart phones users [41]. Adult smartphone users frequently type on touchscreen for communicating with others, surfing the Internet or posting on social media every day [49]. These interactions with computers and smartphones provide rich secondary data sources derived from hand motions for emotion recognition [4], making it a hot topic to study the users' emotional states [49] from KMT dynamics.
In order to recognize emotion, how to model emotions is a primary and essential problem. After a long time evolving of research, there are mainly two approaches to describe emotions: categorical approach and dimensional approach [50]. The categorical approach describes emotions in a discrete way, such as the seven basic emotional states: neutral, happy, sad, angry, surprised, scared, and disgusted [5]. The dimensional approach describes emotions from three dimensions: arousal, valence and dominance [44]. Arousal is defined as the energy of feelings such as low (sleepy) and high (excited); valence describes the level of an emotion from positive to negative; while dominance describes the feelings of control, ranging from lack of control to in control [45]. However, most studies just considered the valence and arousal parameters because integrating dominance is not useful when the accuracy of arousal is low [44].
In this section, we aim to answer our research questions through critically analyzing related articles. The key information for answering these questions is systematically summarized into Tables Ⅵ to VII. Especially, Tables Ⅵ and  VII provide detailed emotion recognition information based  on statistical analysis and machine learning respectively.  They contain multidimensional information from types of  emotions, subject samples, features, recognition techniques, and performance profiles, which is helpful to make comparison from various dimensions. These two comprehensive tables together with other specific tables will answer concerned research questions. They are detailed below.

A. EMOTION ELICITATION METHODS
Emotion elicitation is usually a process that uses some intermediaries to help people arise certain emotional states [42,52,53]. It usually happens in the early stage of the whole emotion recognition experiment [52,53], such as the to stimulate stress [4,43,54,55]. The aim of the elicitation is to make sure that participants are in a certain predefined emotional state before collecting their behavior data [56]. In this case, some KMT dynamics data can be correlated to the certain emotional states and used as labeled data (ground truth) for training classification models [57]. Otherwise, due to the complex nature of human minds, it usually takes a long time to collect enough data corresponding to various emotional states [28]. Therefore, most studies considered stimulating participants' emotions by using pictures [3,52], sounds [58,59], and videos [50,52,45,60]. In addition, there are also some other elicitation methods using interactive elicitation methods such as social interactions [42], tasks [50], and games [77] in given environments [55], etc. Table II summarized these elicitation methods and some existent databases in detail, which answer RQ 1. Researchers in [61][62][63][64][65][66][67][68][69][70][71][72][73][74] built or introduced respective databases for emotion elicitation which are detailed in Table  II. Others either applied some of the databases or generated their own similar materials for their studies. The common stimuli methods and databases for emotion elicitation are discussed below.
The first commonly used stimuli method is picture-based including still pictures or a short video clip. The picture databases consisting of a set of different kinds of pictures are easy to access for researchers and effective for emotion elicitation. For example, researchers [3] selected 60 pictures from the IAPS database to elicit valence and arousal states, while a set of facial images from the POFA database [53] is utilized to elicit neutral emotional state and Ekman's six basic emotions (happy, sad, fearful, angry, surprised, and disgusted). Here, there is a potential problem that previous stimuli might have a long-lasting impact and then influence the next elicited emotion. In order to minimize the echo from the previous elicitation, some research [60] suggests a countdown (control) mechanism to make sure there is an interval between two adjacent stimuli. For example, some neutral images [60] selected from GAPED and EmoMardird databases were used to make participants' heart rate return to normal after a positive or negative elicitation.
Short videos are the most used stimuli. It has been proved in both laboratory and field environments that short videos (7-to 11-minute long) can evoke a variety of longer lasting emotions [52]. Some researchers [52] selected short videos to elicit neutral, different level of valence and arousal combinations. Others used short videos [60] to induce discrete emotional states like happy, sad, fearful, angry, surprised, lovely, hateful, exhausted, and disgusted. It is noticed that some individual video could induce more than one emotion [45], thus it is desired to choose some video clips which are more likely to induce one definite emotion. For example, the researchers [45] found nine 2-minute-long video clips for the purpose.
The second commonly used stimuli method is sound based. There are some sound databases for emotion stimuli. For example, the authors [58] selected 63 sounds from the IADS-2 database to elicit valence and arousal emotions. In the same vein, some researchers [59] used three different styles of music such as relaxation, rock and jazz music while others [75] used heavy metal music and recorded famous funny talks to provoke different levels of arousal.
The above induction approaches are easy to access, but compared with interactive elicitation methods [4,102], their effects are short-lived because emotions induced by them such as short-videos would fade away along with the experiment.
The interactive elicitation methods like social interactions, tasks, and games were thus used widely to elicit emotions. In research [76], participants were supposed to listen to a set of predefined stories and immerse in these situations for neutral, angry, fearful, happy, sad, and surprised states respectively. The authors [77] believed that widespread touch-based computer games can elicit strong emotions in a short period of time. In addition, research [50] also found that different tasks especially those with clear introductions or longer expressive writings could better elicit specific emotions. The research [42] introduced four text conversations with different topics to trigger angry, happy, sad, surprised, and stressed respectively, during which participants were led to chat about a specific topic, reminisce their most enjoyable holiday experience, or be treated rudely.
Especially, task-based elicitation method is more practicable to elicit stress emotion. Stress is a common emotion in current society which may impact people negatively in both work and daily life. Some researchers usually described stress as the reaction to exterior stimuli (environment), such as noises, injuries, coldness, and excessive demands [78]. Stress state is supposed to arise when participants were given limited time to complete a task [78][79][80] with various difficulty levels [54], or requested to do tough tasks under a noisy environment such as with loud traffic noise [55].
The Table II summarized these emotion elicitation methods, and provided some commonly used picture, sound, and video databases. So, the above information answers RQ1.
However, please note that the elicitation methods in Table  II are mainly based on two human senses namely sight and hearing. In fact, human could interact with the world through five senses: sight, hearing, smell, touch and taste. Thus, theoretically people could have emotion elicitations from other three senses-based methods, and the combinations of all five senses-based methods. If we regard these five senses as five independent information input variables (channels), each of them could have controllable effects by changing its control variables, such as environment noises and lighting conditions, volumes of sounds or brightness of picture, etc.

B. EMOTIONAL STATES THAT CAN BE IDENTIFIED FROM KMT DYNAMICS
Although some emotional states can be elicited using above elicitation methods, it does not mean these emotional states can be recognized from KMT dynamics. Thus, there is still a need to find out what emotions can be recognized by KMT dynamics. Based on the reviewed literatures, the emotions they have recognized were summarized in the first columns of Tables Ⅵ and Ⅶ. It is found that discrete emotional states like happy, sad, angry, surprised, fearful, and disgusted [76] can be recognized qualitatively from KMT dynamics. A few studies distinguished different levels of arousal and valence [78]. Some only classified emotions into positive and negative ones [97], and others even only focuses on one single emotion state such as stress, since it has been a main emotion that threatens people's physical and mental health under the increasing workplace competition [98].
More detailed emotions which can be recognized from KMT dynamics have been shown in first columns of Tables Ⅵ and Ⅶ, providing the answer to RQ 2.

C. EXTRACTED KEY FEATURES
Computers and mobile phones are ubiquitous in people's everyday life and work, so it is feasible and low cost to collect adequate data from these devices [41,49]. It is also unobtrusive and non-invasive to record users' KMT events during their normal use [37,39]. After collecting these events data, it is necessary to extract suitable features for emotion recognition. The features are simply categorized into signal features and user features respectively in this paper. The identified features from the two categories will provide the answer to RQ3.

1) SIGNAL FEATURES
Signal features in this context refer to KMT features in terms of keystroke features, mouse (usage) features and touchscreen (figure touch) features respectively.

a: KEYSTROKE FEATURES
Keyboard interactions are usually expressed in forms of key press and key release [49]. Based on these two kinds of events and associated time information, keystroke features can be roughly grouped into timing related (K1~K6), frequency related (K7~K17), and other (K18~K22) features [49] (see Table Ⅲ). Timing related features mainly include duration, latency and typing speed [43,49]. The duration refers to a time interval between a key press and release of the same key or some consecutive keys [43,49]. Latency usually indicates the time between a key up and a key down events [43,49]. Typing speed is usually defined as the total number of words/characters typed in a certain time unit [43,49]. The frequency related features usually measure the frequencies of some specific keys, such as delete and backspace which are usually used to reflect typing error rate [54]. For a keyboard equipped with press sensors, the key pressure can also be collected and used as a feature to help predict emotions [38,76]. In addition, some studies [45,52,59] also considered normalized features, such as the mean or standard deviation of the times using the enter key. In fact, not all these features are used in every research since not all the features have significant contribution to predicting some certain emotions [43].
The researchers [37] extracted K1, K2, K3, K21 (2D and 3D), K7, K10, K15, and K16 for emotion recognition and calculated mean and standard deviation for each feature. Besides keystroke features, they also used the content features derived from resultant typed text. Considering the happy and stressful states, the researchers [82] first just used four features: K7, K8, K19, and K20. Then, in order to enhance the recognition performance, they tested more features including K20 and the mean and standard deviation for K1 and K2. The study [83] extracted commonly used features like K1, K2, K7, K8, K13, K21 as well as calculated mean and standard deviations based on some most used Polish 2D and 3D words. As a result, it obtained a 36dimensional feature vector for emotion recognition. The research [51] extracted 19 features from free text and fixed text typing patterns, which include not only the K1, K2, K3, and K6, but also the computed features such as the mode, standard deviation, standard variance, range, min, and max of the above features. In addition, it also collected free and fixed text content. However, not all these features were used to predict emotional states. Further, it iterated all combinations of above features and finally selected seven features (K6, mode and min of K1, K2, and K3) that worked best. Input rate (K19) The ratio of input during the typing procedure.
Pause rate (K20) The ratio of pause during the typing procedure. Number of total key events (K21) The number of all key events.
Typing amplitude (K22) The maximum value in the range from 10 milliseconds before to 20 milliseconds after the typing instant.
Besides, some researchers [42,50,55,76] thought emotion had impact on typing pressure, so they collected K18 from keyboards with pressure sensors. For example, researchers [76] computed the pressure sequence and other five features: mean value, standard deviation, the difference between max and min, the positive energy center, and the negative energy center from typing data. They also found that two traditional keystroke features: K2 and K3 are useful.
On the contrary, some researchers [3] and [58] intended to investigate the influence of emotion on the typing patterns. In their experiment, K1, K2 and K7 were extracted to find the influence of emotion on these features.

b: MOUSE FEATURES
Mouse is another common input device and mouse features are also effective in emotion recognition whether used separately or in combination with keystroke features [4,84]. Depending on the types of mouse events, mouse features can be divided into three categories: click related features (M1~M6), movement related (M7~M19) features, and other (M20, M21) features [45,85]. The most used mouse features are summarized in Table Ⅳ.
Other research using mouse features listed in Table IV to elicitate emotions are shown in Tables Ⅵ and VII.

c: TOUCHSCREEN (FINGER-STROKE) FEATURES
In recent years when smart phones have become more and more popular among ordinary people, studying users' emotions from touchscreen patterns attracts more research attentions recently. When using touchscreen, people usually move cursor to certain position and then type using finger or stylus as what they do using traditional computer keyboard and mouse, so besides some features related to touchscreen itself (see Table V), touchscreen features also include similar characteristics with both keystroke and mouse features.
In order to investigate the possibility of emotion recognition from touch-based devices, research [77] extracted T1, T14, and K1 from the mobile gaming interaction data. Then sixteen finger-stroke features were computed, consisting of average, median, max and min values of the T8, T10, T14, and T15. Then the research mapped features onto the specific emotional states through visual inspection. For example, the pressure feature shows a clearer separation between frustration and other emotions.
After using the information gain method, the researchers [96] identified 10 features having a relatively stronger association with emotions from 14 features. These features included traditional keystrokes such as K6~K11, K13, K14, K18 and environment features like location and weather, and other features like device shake count. Similarly, the research [89] used information gain method to rank the importance of each feature, and also used the timestamp of each tap event and the types of key input (alphanumeric keys, delete keys, et.al). In addition, it considered the other two features: working hour indicator and persistent emotion.
The BiAffect research project [41] collected data from participants when they used supplied mobile phones, whose standard keyboard was replaced by a customer keyboard.
The collected data were then used to extract K1, K5, K7, K13, M18, T5, T6, and T7 features. In addition, it also investigated the relationship between the typing dynamics and time against each hour of a day and each day of a week.
Different from above research, the research [42] did not consider the duration features by simply regarding them as a part of single input. And as a result, touch events were aggregated and transformed into some two-dimensional heat maps for use. The number of mouse movements in a defined time interval. The distance of mouse movement (M8) The straight-line distance between the beginning and the end of each movement. Duration of mouse movements (M9) Time of one mouse movement event.
The length of mouse racing line (M10) The length of each movement trajectory.
Length of pauses in mouse movement (M11) The length of the pauses between mouse movements (a pause is defined variously by different researchers). Number of pauses in mouse movement (M12) The number of pauses between mouse movements.

Off-click movement (M13)
Move mouse with no mouse-button is pressed. On-click movement (M14) Move mouse with mouse-button is pressed.

Mouse inactivity duration (M15)
The total time of mouse pause.   [4] also utilized keystroke features and mouse features in two separate experiments. As a result, the keystroke features K1, K2, and K7 and the mouse features M16 and M19 were used and logged by a constant distance rather than a constant time interval (for example, 10-pixel length straight line). On the contrary, in every 5 seconds interval, [45] logged 17 attributes such as M1, M2, M3, M8, M20, K1, K2, K7, K8, K11, and K14.
As the development of smart mobile devices, a good amount of research combined touchscreen features with keystroke and mouse features into application. The research [90] used K1, K2, and K3 from keystroke dynamics, M8 and M10 from mouse dynamics, and T1, T10 and corresponding time. In addition, other computational features, such as the average, min, max, standard deviation, variance of the first and second derivatives of the above features were also considered.
More used features from literatures are shown in column 3 of Tables Ⅵ and Ⅶ. To answer the RQ 3, K1, K2, K7, K18 and M2, M8, M16, M19 and T10, T14 features are found to be the most frequently used features. This may reflect in some degree that these features are more effective than others. Although some research ranked the importance of the features they used, the reason why some features overperform others and what feature groups work well for specific emotions are still not clear. Thus, more work is needed to better answer these questions.

2) USER FEATURES
User features refer to those related to individual differences. Therefore, user features can cause different patterns of using keyboard, mouse and touchscreen for different people [6,43,56]. The features like age, gender, culture context, computer proficiency, and some habits like left-handedness, righthandedness and the body postures when users are typing can be regarded as user features [13,41].
Many applications usually neglect user features especially when they collected data in certain scenarios like in education [92,94], teamwork [13,95], and health care [41,82,93]. However, in real world settings, users have a variety of culture background, different proficiency of using input devices, and different hand using habits (left-handedness or right-handedness), which can impact the recognition result. For example, a person who uses computer every day usually has a higher typing speed and conversely a person who seldom uses computer types slowly. In this case, if the user features are ignored, researchers may wrongly think the former is in the positive emotion and the latter is in the negative emotion [56]. Thus, user features are also important to make recognition results more credible.
Only a few research have considered the user features. In [78], participants were requested to provide their information like the age, gender, experience, and education level at first. These features were then used as control variables in different models to reflect their influences on results.
Some researchers [55] did not utilize user features directly in their experiments, but they had some restrictions in choosing participants to avoid the influence of individual user features. The restrictions include right-handedness, no color blindness, no tremors, no medication for hypertension or any other cardiovascular disease, and no history of musculoskeletal disorders. These restrictions make sure that participants have similar profiles and can avoid individual difference in a great degree. The study [53] focused on the elder's emotion detection based on smartphone keystroke dynamics, which has a potential to monitor pathology cases of the elderly people. As a high percent of the older people did not use smartphone, participants were thus required to meet the criteria regarding level of education, experience in using a smartphone, non-existence of any typing difficulties and vision problems, and non-existence of mental disorder. In addition, participants were also requested to hold a smartphone in one hand which aimed to eliminate the influence of posture on keystroke patterns.
User features have been considered as a factor that can influence the recognition result, but most research neither take this kind of features in their study nor investigate what user features and how these features influence the recognition performance. Thus, we cannot draw a conclusion whether user features are contributed to emotion recognition.
For answering our RQ 3, the main signal features are summarized in the third column of Tables III, IV and Ⅴ. They are most frequently used features. Other minor features are listed in column 3 of Tables Ⅵ and Ⅶ.

D. RECOGNITION METHODS
The techniques for emotion recognition from KMT dynamics can be divided into two classes namely statistical methods and machine learning methods (including neural networks). This subsection aims to answer RQ 4.

1) STATISTICAL METHODS
Statistics is a widely used data analysis method. In the early years, many researchers used statistical methods to recognize emotional states. The difference between them is that they may define different threshold values to classify different emotions.
The SPSS analysis [78] is applied to compute the P-value and F-value between each variable to evaluate the correlation between keystroke dynamics and emotions. Similarly, correlations between learner's behavior B, mouse behavior B(M), and keystroke behavior B(K) are examined using the Pearson correlation tests [79], showing that there is a great possibility to develop a cost-effective system to sense learners' emotion by using KMT dynamics. By computing Wilcoxon signed rank and Spearman's Rank-order correlation [50], it is found that the stressful condition is associated with self-reporting stress and there is a positive correlation between the self-reporting stress and typing pressure. The most used effective statistical methods are concluded in the fourth column of Table Ⅵ along with other information like emotions, subject samples, and features, which make them more integrated and easier to make some comparisons.

2) MACHINE LEARNING METHODS
As the computer and AI technologies advance further, more and more research focus on machine learning methods (including Neural Network) in emotion recognition.
Many researchers tested only one machine learning algorithm to classify emotions. For example, researchers [37] aggregated the data and just used Decision Trees (DT) algorithm to identify emotions since DT is a simple and lowcost solution. However, they found that participants' responses were not distributed equally to all levels of each emotion. To solve this data skew, they applied the undersampling method. After classification, a 10-fold cross validation method and Kappa statistic were used to assess the model. The study [96] adopted a machine learning approach to analyze data using Weka. The extracted features were used by a Bayesian Network classifier since this classification method performed best among other machine learning classifiers such as Naïve Bayes, DT, or Neural Network in repetitive 10-fold cross validation experiments. SVM method was also used [90] to recognize emotion with KMT features separately. After emotion recognition, three evaluation criteria namely recognition accuracy, false positive alarm, and computational time were applied to evaluate the algorithm performance. The study [82] predicted the happy and stressed states using a K-Nearest Neighbors (KNN) classifier in Weka.
Some researchers tested more than one machine learning algorithms. For example, the researchers [77] implemented three learning algorithms: Discriminant Analysis (DA), Artificial Neural Network (ANN) with back propagation, and Support Vector Machine (SVM), to build an automatically discriminating system for classifying four emotional states (excited, relaxed, frustrated, bored), two levels of arousal, and two levels of valence. While others applied different learning methods to different features in a combination way. In [51], both keystroke dynamics and text patterns were used. For keystroke dynamics features, they were inputted into Weka software to classify emotions, while for text patterns, Vector Space Model (VSM) was applied to the ISEAR dataset as text pattern classifier. Similarly, in [76], three learning methods were applied to: global features of pressure sequences, dynamic time warping, and traditional keystroke dynamics respectively.
The study [83] trained different classifiers such as DT, Neural Networks, KNN, Naïve Bayes, AdaBoost, Rotation Forest (RF), and Bayesian Networks for testing emotion recognition, and found that there is no single one classifier good enough for identifying all predefined emotions from all participants. And reversely, building an individual emotion classification model for each participant and for each emotional state was not feasible since data was usually not enough to train a model in this case. In addition, the research observed that keystroke rhythm was not only influenced by emotions, but also typing devices. These may provide reasons for using the combination approach.
In [89], three models were tested: L2-regularized Logistic Regression (LR), SVM with Radial Basis Functions kernel, and RF using 10-fold cross validation. From the test result, the RF model generated the best classification. The study [4] first labeled the keyboard and mouse features in arousal and valence dimensions, and then tested five machine learning algorithms: LR, SVM, Nearest Neighbors, C4.5, and RF. Similarly, the study [45] exploited Bounded K-means Clustering, KNN method and Weka tools to classify ten discrete emotions. Different from the above, some research attempted to combine and fuse signal features before feeding them into an emotion classifier. For example, the keypress features and accelerometer features were fused early [41] before feeding them into machine learning models. The reason for early fusion was that extra information may be found in the aligning process of features. Two feature fusion methods: EF-dropna and EF-fillna were applied with the observation that accelerometer values were dropped or filled. Different with other research, this study used CNNs and RNNs in combination since the kernels in CNNs could only capture the local features, and training time of RNNs is longer. So, combining these two methods and taking advantages of each method could give a better result. The researchers [42] proposed a semi-supervised classification pipeline for predicting affective states based on touch data from smartphone. The classification network consisted of fully connected layers and was trained using the labeled data.
The most used machine learning methods are concluded in the fourth column of Table Ⅶ. To summarize, statistical methods are usually used to do some qualitative analysis, while machine learning methods are for quantitative analysis by evaluating their classification accuracies. As shown in the column 4 of Table Ⅵ, the statistical methods: ANOVA, Spearman correlation, and the Wilcoxon signed rank were more popular. Through these methods, the connection between emotions and KMT dynamics was studied, providing more experimental evidence for future work. From the column 4 of Table Ⅶ, the RF, KNN, and SVM algorithms were more commonly used. Most research introduced the accuracy as a criterion to evaluate their recognition performance. The accuracy of these research varied from about 50% to more than 90%. However, the size of subject samples and the emotions intended to be recognized are often different, so it is hard to say which algorithms are more effective than others.
Obviously, there is a trend that more and more research applied machine learning methods. However, there have been some other advanced deep learning techniques performed very well in other field, like Generative Adversarial Network (GAN), Deep Belief Network (DBN), Radial Basis Function Network (RBFN), etc., but whether these models could perform well in emotion recognition still need more future research. (1) Linear regression; (2) ANOVA; (3) Correlation analysis; (4) Coefficient of determination (1) Excited stated has some negatively correlation with remaining task time.
(2) A significant difference of the valence can be observed between the groups.
(3) Negative emotion has some negatively correlation with typing speed, and positively correlation with typing error rate and pressure. (1) ANOVA; (2) Spearman correlation; (3) Pearson correlation (1) There are significant impacts of direct instruction and external stimuli on learner's motivation and affective state.
(2) There are significant correlations between direct instruction, external stimuli, affective state, and cognitive state.
(3) There are significant correlations between mouse behavior and keystroke behavior.
(2) The stressed condition was associated with significant increases in self-reporting stress.

K1, K2 T-test
The results prove the significance in the differences in the typing patterns under positive and negative emotions for all subjects.
Stressed [98] 20 students from University of Colombo School of Computing (1) K1 (Selected 2D for letters: th, he, in, er, an) (2) K1 (Selected 3D for letters: the, and, ing) Wilcoxon signed rank 90% confidence level for significant differences between stress and non-stress

IV. APPLICABILITY EVALUATION
Emotion can influence one's behavior and reflect one's health states to some extent. Thus, an intelligent system capable of recognizing and interfering one's emotions can perform better in the areas where emotion plays a crucial role, such as intelligent tutoring system, healthcare scenarios, design and team works, intelligent toys, lie detection and customer service [3,51,76]. To build such affective intelligent systems, awareness of individual's emotion is a basic requirement. In addition, measuring a system user's emotion without causing inconvenience to the user is preferred. With this concern, the researchers [90,96] proposed to recognize emotions from KMT dynamics. The study [42] also realized that knowing the user's emotional states enables mobile phones to support more intelligent interactions, making the devices react/interact more appropriately, naturally, and friendly with users [76]. There are several application areas requiring emotion recognition research with KMT dynamics (See Table Ⅷ). According to Table Ⅷ, we will discuss some popular applications in detail and answer the RQ 5 and RQ 6.

A. HEALTHCARE
In healthcare area, emotion recognition from KMT dynamics has been applied in detecting Parkinson's disease and mood disorder. By predicting emotional states based on KMT dynamics, the research [82] showed a potential for exploiting biometric data in the healthcare-oriented domain. In [53], the idea of early identification of Parkinson's disease through behavioral data from smartphone was tested, both emotional and physical states were analyzed for monitoring ageing pathology cases like Parkinson's disease, based on the i-PROGNOSIS project. The research [104] also proved that using keystrokes under emotional stimuli is an effective and intelligent way to detect onset of Parkinson's disease. For predicting mood disorder through emotion prediction based on keyboard data, the research [41] proved that it is feasible and effective to predict presence and severity of mood disturbance, pointing to the potential of medical treatment for mood disorders.

B. EDUCATION
It has been proved that emotion has a significant effect on learning outcomes [105]. Keyboard and mouse are the most common input devices in learning context and collecting data from these devices is widely considered as non-intrusive, so using KMT dynamics features to investigate the learners' emotional states in educational scenario has been a simple and effective way [87,92,94]. In this field, the research [54] proposed to recognize emotions of learners who learnt English as a second language for improving learning performance. The study [79] also believed that learners' performance could be affected by their emotions. Thus, it is important to build a modern Intelligent ITS which is capable of recognizing learners' emotional states and adjusting learning content accordingly. While the emotion recognition from KMT dynamics can make the ITS systems lower-cost and easier-to-use.

C. TEAMWORK
As team works have become a popular work model in modern society, it is important to study the impact of emotions on teamwork. Researchers increasingly realized that emotions are not only intrinsic to human experience but also inherent to the situation where people interact with others or environment [2,7,9,12]. Thus, studying emotions in teamwork settings has been a potential area [8]. Erik Cambria [106] considered that recognizing emotions can improve communications among human workers which is important to define products and services. Specifically, the researchers [12,13] focused on design teams and attempted to study the relationship between interaction dynamics and performance of engineering design teams. As a result, they proved that emotions are critical for both successful romantic relationships and high-performance teamwork.

D. USER AUTHENTICATION
In fact, keystroke dynamics was studied, widely used, and rapidly developed as a nature choice and complementary method in user authentication over the last 50 years [48,[107][108][109]. This can be very practical because it has been proved that each user has a unique style when they type through keyboard [109]. More specifically, the time interval of successive keystroke, keystroke duration, key pressure or some other keystroke factors can be used to build up a unique signature for a user [108]. The study [38] proposed a biologic verification method based on pressure sensor equipped keyboards. Pressure sequences and other traditional keystroke features were implemented respectively to reinforce the usual password authentication. By conducting two experiments, the research [110] found that both keystroke and mouse dynamic features can measure some traits of personality. In order to continuously authenticate users after login authorization, the study [48] proposed a user-adaptive feature extraction method to collect keystroke dynamics from free typing text. Furthermore, as the increasing spread of touchscreens on mobile phones, the study [40] focused on using touchscreen data for this purpose. It performed a large-scale study which showed that a specific user can be identified in 5 users with a relatively high precision. From the result, the conclusion has been made that touchscreen input patterns are distinguishing for each person. Also based on mobile touchscreen, the study [39] identified users almost immediately based on the way they performed while maintained users' convenience.
In this section, the Table Ⅷ and the detailed analysis answered the RQs 5 and 6.

Ⅴ. GAPS AND FUTURE WORK
Through this systematic analysis of the related work in emotion recognition based on KMT dynamics, some research gaps are identified. In this section, we discuss the following key research gaps and potential future work.

A. BUILDING A COMMON DATASET
Many researchers have investigated emotion recognition from KMT dynamics, and some of them reached a high accuracy, but it is meaningless to compare each of them since there is no common dataset as benchmark for comparison [49,93]. This need was identified in 2009 [93] but still not be met yet [49]. The datasets used in literatures are usually built by individual research groups and the features contained in different datasets are either diverse or organized in different formations [104], making them difficult for using as benchmark. In addition, emotional states that have been recognized are also very diverse. For example, some researchers cared about happiness and sadness [76] while others about arousal and valence [78]. Thus, it is necessary to build a common dataset in the future for the purpose of comparing different algorithms. As for what features should be contained in this dataset, further work is needed to identify which features play a more significant role in KMT dynamics-based emotion recognition.

B. CONSIDERING USER FEATURES
User features such as age, gender, culture context, computer proficiency, some habits like left-handedness or even the body postures may also have impact on typing behavior [6,78]. Ignoring these features may result in a wrong classification. However, most existing studies did not consider user features when building their own dataset. Besides, other factors come from the devices (e.g. the keyboard layout, the distance between keyboard and hands) are also important factors that may change participants' typing habits. Thus, using these elements as a supplement of KMT features may improve the recognition accuracy. In future work, more studies are needed to explore how to incorporate users features into the emotion recognition.

C. ASSESSING RECOGNITION PERFORMANCE USING BIOLOGICAL SIGNALS
Almost all the above research relies on self-reporting for the assessment of emotion recognition. In these studies, participants were requested to record their emotional states in the form of self-defined questionnaire or Self-Assessment-Manikin (SAM) questionnaire [92]. This method forces participants to break their typing behavior, which defeats the purpose of using keystroke dynamics as a non-intrusive method to recognize emotions. On the other hand, it is hard to record participants' every emotional state accurately, especially when the limited options of questionnaire do not contain their subtle emotions, or they are even not sure how to describe their emotions [6]. The biological signals are much more objective and precise in terms of reflecting people's emotional states [6,111,112]. For example, for those who do not like to show their emotions, recognizing their emotions through appearance or behavior habits would be deceptive. On the contrary, biological features like EEG signals could still reflect real emotion changes. Therefore, in the future, using biological signals as ground truth in parallel to the KMT dynamics should be a potential method that improves training data quality and help researchers evaluate their algorithm performance more objectively and accurately.

D. IMPROVING PRIVACY AND SAFETY THROUGH FEDERATED LEARNING
Another challenge is how to improve the safety of the data and protect the privacy of users. At present, almost all research collected rich data including personal information, and then sent those data to a server or cloud storage devices. This transmitting increases the risk of privacy information loss or theft. Thus, how to ensure the security of these private data is a primary problem that each researcher should consider. However, it is inadequately addressed in most past research. As the development of hardware performance, the computers, mobile phones and wearable devices have growing computational power, so it is increasingly attractive to store and compute emotion related data locally [113]. Federated learning is a machine learning setting which aims to train a model in the case that training data are stored in different remote clients. Under the federated learning concept, the update of the model is realized by aggregating all the updates of each device [114]. In future work, applying federated learning technology into affective computing will be an innovative method for privacy protection.

E. RECOGNIZING EMOTIONS THROUGH DIGITAL WRITING
It is a trend that touchscreen is used in smart devices with finger-writing capability. Thus, writing will no longer be restricted to paper and computer keyboard. On the other hand, the keyboard layout is designed for inputting English letters, for people whose native language is not English, they may find it is not easy to use, instead, writing on a touchscreen may be much easier than typing on a normal keyboard. Therefore, in the future, writing on a touchscreen may give users similar experience as writing in paper with a pen or brush [115,116]. For example, combining the traditional calligraphy with modern digital art provides a new way to create, present and preserve calligraphy works. It is also a hot topic to recognize emotions from handwriting on touchscreen [81,115]. For example, the research [117] proposed an interactive system that can reflect calligraphers' emotions at the same time when they are writing. Up to now, few research combines KMT dynamics with calligraphy for emotion recognition. However, there is a real situation that some people are good at writing in their native language, but not good at inputting them through keyboard typing (for example, writing Chinese characters by hands rather than typing letters through keyboard). In this case, changing typing input model into writing model can be a convenient solution for most smart phone users. Thus, learning users' emotions when they are writing in a touchscreen through stylus or finger rather than typing by a touchscreen keyboard is a potential research direction.

Ⅵ. CONCLUSIONS
In this paper, we have systematically reviewed emotion recognition based on the KMT dynamics along the dimensions of emotion elicitation, features, recognition methods, and applications. Based on these dimensions, we have answered six research questions which most researchers want to ask, identified some key related application fields, current research gaps, and future research directions.