Clustering-Based Emotion Recognition Micro-Service Cloud Framework for Mobile Computing

In a situation where life becomes more stressful and challenging, people feel compelled to be more concerned about their mental situation. Different emotional statuses are external reactions to different mental states. Therefore, researchers always identify people’s mental situation by monitoring their real-time emotions. At the same time, due to the availability of built-in sensors in a smartphone, applications that can identify real-time emotions of mobile users are constantly emerging. However, compared to most emotion recognition algorithms, computing resources and battery life in mobile phones are always limited. This makes accuracy and latency of these applications are unsatisfactory. In this paper, we propose a micro-service platform for mobile emotion recognition application developers (MSPMERAD) which can supply high performance. First, a classifier fusion emotion recognition algorithm is proposed by using a dynamic adaptive fusion strategy. Second, this new algorithm is encapsulated into a micro-service. With other affiliated micro-services such as data uploading, preprocessing, etc., developers can ignore the implementation of the emotion recognition algorithm and just focus on how to collect sensor data and interact with users. The accuracy and latency of one application based on the MSPMERAD are compared with another application that is implemented using a locale emotion recognition algorithm. Experiments based on the daily behavior data of 50 student volunteers show that the application based on our platform has higher recognition accuracy with a more reasonable time.


I. INTRODUCTION
With the development of modern society, The corporeal property of one person is richer and richer. People can use a lot of material resources in their daily life and work. This frees them from worrying about basic livelihoods such as food, clothes, etc. But a lot of factors make life more challenging and pressured than anytime before. These factors come from all aspects of life such as tests in the education system, requests from friends, family relationships, countless tasks at work, etc. The constant pressure will have a far-reaching influence on people's mental health. Nowadays, people are becoming more aware the fundamental connection between The associate editor coordinating the review of this manuscript and approving it for publication was Honghao Gao . their physical health and their mental health. They increasingly value their mental health. Because emotion regulation is an external reaction to the mental situation of people, it is stated as one of the key features of people's mental health by researchers [16]. Therefore, it is very important that people's emotional status is recognized in time to the medical judgment of their mental disorder and their timely mood regulation. Emotion recognition which is known as a promising technology has been widely used in mental health monitoring over the previous decades [29], [30], [35].
Because of the unlimited commercial and academic potential of emotion recognition in the fields of artificial intelligence, it has also become one important research topic. [22]. Emotion detection based on AI algorithms has been a $20bn industry by 2019 [5]. Physiological signals such as electromyographic signal [31], skin temperature [15], facial expressions [23], blood volume pulse [34], etc. are typically used to identify human emotional status by scientists. However, these methods are very difficult to apply in practice because these signals are impossible to get without special types of equipment and bio-sensors. Doctors always hope to monitor the processes of their patients' emotions changing, because these processes can help them do some decisions and judgments. For instance, some special carry-on equipment are always needed by patients because their doctors want to watch their long term emotional status. This can help doctors to find the connection between patients' emotional status and some clinical symptoms. Wearing these equipment is inconvenient. Sometimes it will also make patients depressing. So the timely recording and recognizing the emotions of the observed person who are not informed ahead of time is very important.
In recent years, mobile phones have replaced computers as the main entrance of the Internet [33]. At the same time, mobile phones are more affordable and powerful in functions than ever before. The time people spend on their mobile phones rapidly increases. Moreover, sensor devices including gyroscope, temperature, lighting and gravity, etc. are equipt not only in high ranking smart phones but also in low ranking and middle ranking smart phones. Many behaviors of users such as the working speed, the sound volume, the tension of the phone's touch panel when the finger press on it, etc. can be record by these sensors. Many types of emotional status can be extracted from these behavious. The popularity of sensors in mobile phones enables doctors record and recoginize the emotional status of their patients by using their mobile phones without inform the patients [13].
Good human emotional state is very important for the creating of good physical and mental health. Therefore, one technology that can identify the external features of human emotional state and recognize the corresponding emotional state timely may bring enormous benefits to our social society. We focus our research on these two issues. A new mobile users' emotional status recognization method is present based on the analysis of physiological and ecological features that are collected by mobile phone sensors. First, 50 volunteers are selected from Xidian University. Their daily behavior data are collected through our mobile application (EAmobile) which is installed in the volunteers' mobile phones. Second, a classifier pool including most machine learning based classification algorithms is assembled by thinking about both complementary and diversity. Third, classifiers that have high inconsistent are picked and combined based on a dynamic weighted fusion strategy. The class-conditional probabilities and prior probabilities of base classifiers are used to dynamically decide their fusion weights. Finally, the classification results are map to the emotional status of mobile users.
However, as you know, the computing resource and battery capacity of mobile phones are limited [11]. This does not meet the need for fast execution of most classifiers.
To address this issue, researchers have attempted to offload computing tasks to remote cloud platforms [12]. But in our system, the classifiers which are used in the final fusion strategy is dynamically changed. And various new classifiers will be continuously added to the classifier pool during use. To ensure its scalability, we adopt a micro-service architecture to implement our system. The fusion framework and all classifiers are encapsulated as individual micro-services. Based on this, mobile phones can just focus on collecting data and displaying results and neglect the recognition process which is resource sensitive. We implemented the system using micro-service architecture and all computing localization architectures respectively. Experiments based on the daily behavior data of 50 student volunteers show that the application based on the micro-service architecture has higher recognition accuracy with a more reasonable time. We provide the following main contributions: • We study the relation between mobile phone sensor data and emotional status. First, inspired by the circumflex model of affect [27], we describe emotional status from two aspects: pleasure degree and activity degree. Second, the real-time emotional coordinates of mobile users are computed by using the classifier fusion method.
• We study the classifier selection strategy and fusion strategy. Accuracy and difference factors are both considered to ensure that the best classifiers for the current data set can always be selected as base classifiers. The fusion weight of each base classifier is dynamically set by using its historical decision and class-conditional probability.
• A common micro-service based emotion recognition framework is proposed to ease the implementation of similar applications. All functions are encapsulated as individual micro-services which make them loosely coupled. Developers can arbitrarily change the implementation method of a single function without re-developing other functions.
In the following, we describe the organization of this paper. In Section II, we retrospect some theories and tools for resolving the emotion recognition problems. We discuss in Section III the main methodology of the fusion method. In section IV, we construct an emotion recognition microservice framework. In Section V, we perform experiments to verify the accuracy of our new method and the efficiency of our framework. We finally give the conclusion in Section VI.

II. RELATED WORK
In the fields of emotion recognition, emotion recognition going by text analysis, emotion recognition going by verbal and facial expression, and emotion recognition going by physiological signal analysis are three common categories. We will review these three categories in turn in this section. For the emotion recognition going by text analysis, two main types are researched: the semantic analysis based method and the word classification based method. The former uses the semantic network to identify the emotional status corresponding to a special text. The richer the semantic knowledge base is, the more efficient this method is. In paper [2], SentiwordNet is present to devised for opinion mining and sentiment classification. A semantic network with common-sense knowledge which is based on the natural language is introduced into the public knowledge classification by Cambria et al. [7]. Sentiment analysis can be achieved through the multiple dimensional expansion of the semantic network. The latter implements emotion recognition by a special dictionary. This dictionary maps different emotionally salient words to different emotional scores. The emotional state implied in one paragraph can be calculated by fetching the emotional score of each word from this dictionary. Obviously, the effectiveness of this method is mainly dependent on the richness of emotionally salient words in the dictionary. Thus, many researchers try to design recognizing and scoring methods for emotionally salient words [8], [9], [20].
The analysis of speech, text, and images are more and more effective because of the DNN's rapid development and rise. Thus, facial and verbal expressions analysis based emotion recognition comes to be an important issue. People can identify the emotions of other peoples from their facial muscle movements and the corresponding facial expressions because different emotions relate to different expressions. For instance, the eomtion status of one person can be judged as happy when his ruffles emanate from his canthus and the corner of his mouth curves up. In contrast, the emotional status of one person can be judged as anger when he frowns and his eyes pop out. Both overall features and local features of the observed person's facial expression can be used to recognize his emotion. The former considers features of the entire face because people have different overall facial features when they have different emotions. The latter is ground on that people have facial features of various relative positions, sizes and shape in different situations. In this field, a lot of achievements have been made. For instance, a big quantity of pictures of facial expression from various ages, genders, and races are collected to create a facial expression data set by Affectiva [26]. Based on this data set, an artificial intelligence algorithm is designed to recognize human face textures and wrinkles, changes in facial feature shapes, and so on. Many people always do not want other people to know their thoughts and intentions. So they try to be emotionless in their normal life like wearing a poker face. This reduced the effectiveness of emotion recognition technology that is based on the analysis of verbal and facial expressions. Therefore, physiological signals are used to identify the emotions of people [4], [14], [19], [21], [25]. Various emotional characteristics of EEG are studied by Wang et al. by tracking the EEG signals' changes. The classification of emotions is finally identified by the establishment of the connection between emotional status and EEG characteristics [32]. The EEG signals are combined with the clip database to construct a real-time emotion recognition system by Liu et al. [24]. Algorithms of emotion recognition going by machine learning have valid presentation on extracting signals which have high frequency and spatial characters. Because when machine learning is used to fuse multimodal information like various physiological features and EEG signals a good effect can be achieved [3], [18].
Nowadays, more and more researchers are working to obtain people's emotional states by mobile phones. M. SHAMIM et al. use the embedded cameras of smart phones to get the facial video of users [28]. The most dominant bins of the concatenated histograms are extracted from representative frames by the Kruskal-Wallis Feature Selection. The emotion is finally classified by Gaussian mixture model-based classifier which can guarantee high recognition accuracy in a reasonable time. In this work, the speed of emotion recognition is improved by simplifying the computational complexity of the algorithm. However, this will inevitably decrease accuracy. Humaid Alshamsi et al. try to solve this problem by introducing cloud computing [1]. Facial video and Speech emotion signals which are collected by mobile phones are transferred to the back cloud computing platform. They use Mel-Frequency Cepstral Coefficient Facial Landmarks to pre-process the signals. A (SVM) based algorithm is devoted to cluster the emotion. This framework can reduce the computational pressure of mobile phones by offloading the complex emotion recognition process to the cloud platform. As we all know, the amount of data describing facial video and voice emotion signals is usually very large. The transmission of these data between mobile phones and the cloud is always time-consuming and energyintensive [12]. Therefore, emotion recognition algorithms based on mobile sensors with a small amount of binary nonaudio and video data should be considered. These sensors include gravity, gyroscope, pressure, temperature, lighting, etc.. They can perceive user's actions with a small amount of data such as a value.
With the promotion of a cloud-based application, its user base grows bigger and bigger. The coordination among different development teams becomes more and more difficult as its code gets more and more complicated. Recently, miroservice architecture is gaining popularity due to its granular approach and loosely coupled services [28]. In a miro-service architecture, an application consists of many independent services that answer to different functions. In businesses across industries-from telecommunications and retail to financial services and manufacturing-Many major companies like Amazon, Netflix, Uber, etc. are choosing miro-service architecture to design their new applications. Researchers are trying to introduce miro-service architecture into many different arear [17]. Svetoslav Zhelev et al. use microservices and event driven architecture to implement big data stream processing [36]. Alexandr Krylovskiy et al. apply the miroservice architecture style to design a Smart City IoT platform [20]. This method can increase the energy efficiency of a city at the district level. VOLUME 8, 2020

III. MAIN METHODOLOGY
In this section, we discuss the main idea of the proposed classifier fusion method. First, we describe the emotional model. This model is the base of the mobile users' emotional statuses measurement. Then, we explain how to eliminate useless data and extract meaningful features from the collected daily data of mobile phone sensors, followed by the detail of the proposed dynamic classifier fusion method.

A. EMOTIONAL MODEL
Considering that discrete variables are not suitable for linear analysis and transformation, we propose a new emotional model which is inspired by [27]. Although people may have different paths to their well-being, they have some common external characteristics. When they feel good, they will be pleased and active. Aristotle said: ''Happiness is a state of activity''. The good emotion is not just accompanied by pleasure but also positive activity. Therefore, the proposed emotional model describes users' emotional statuses from two dimensions: activity level and pleasure level. Activity levels are used to measure how active or passive a user is. A coordinate system is constructed with pleasure as the Horizontal coordinate and activity as the Vertical coordinate (shown figure 1). Then, we can position each emotional state in this coordinate system.  A1) represents very inactive and very unpleasant. Through statistical analysis of the collected information, we can know that the emotional statuses of most mobile users are higher than the third and lower than the fourth levels. Therefore, the emotional status value of one mobile user satisfies the following rule most of the time: P3 ≤ Pi ≤ P4 and A3 ≤ Ai ≤ A4. This is because that extreme emotions rarely occur. In normal life, one person is more probable in a happy status than in a sad status. The statistics results of the distribution of mobile user's emotional statuses are shown in table 1 of paper [10].

B. CLASSIFIER FUSION METHOD
The main idea of our method is to map the sensor data collected by the mobile phone into the emotional model. Then we can get the real-time emotional status of the mobile phone user. As we all know, there are many classifiers that can be chosen, such as Perceptron, Naive Bayes, Decision Tree, K-Nearest Neighbor, Support Vector Machine, etc. In real pattern recognition applications, different classifiers make ''independent'' errors. In our system, different sensors will collect different data sets. These data sets would give different results based on different classifiers. Therefore, we use multiple classifier method which achieves better results by integrating classical classifiers. The main problems of traditional multiple classifier methods are how to choose good base classifiers and how to design a fusion strategy. In our method, based on the diversity strategy, complementarity is taken into account simultaneously when selecting a classifier. In the classifier selection phase, we combine clustering algorithms with the silhouette coefficient to determine the number of target base classifiers and use inconsistencies to evaluate the differences between them. In the fusion phase, we set the weights of the basic classifiers based on the prior and conditional probabilities.
In the followings, the details of our classifier fusion method are described from three phases: the generation phase, the selection phase, and the fusion phase.

1) GENERATION PHASE
Considering the diversity of base classifiers, we construct a base classifier pool in this stage. The base classifiers are chosen from the classification algorithms that are based on decision trees and neural networks. We train these base classifiers on different data sets to calculate their diversities.
To improve the accuracy of the result of the selection phase and decrease its computing complex, we generate a classifier pool in this phase that consists of as many potential good base classifiers as possible. As mentioned above, the diversity of the classifier pool is very important. It guarantees that we can obtain different but complementary classifiers based on different or disjoint input sub-spaces. To achieve diversity, we choose unstable classifiers like decision trees and neural networks to construct the classifier pool. These classifiers are trained on the same data set using k-fold cross-validation and bagging.     Fig. 2 show the training process of our base classifiers. For the same data set, even the same algorithm may generate different classification decisions based on different parameters. In this case, we use Bagging to combine the decision boundaries of these classifiers to reduce the overall error and achieve a better classification effect. For a single individual classifier, it will be trained using 10-fold crossvalidation. The original data set is randomly divided into 10 parts using the non-repeated sampling method. 9 of these parts are selected as the training set. Another one is reserved as a testing set. We assume that T i denotes the training result corresponds to the one using the ith part as the training data set. Then the accuracy of the trained classifier is obtained as the average of these 10 T i : 1 10 10 i=1 T i . This cross-validation method can reduce over-fitting to a certain extent. At the same time, it can also obtain as much valid data as possible from the limited information which can reduce the sensitivity of the trained classifier to data partitioning. When the segmentation is performed only once (or less frequently), the distribution of the subset may be approximately the same as that of the original data set. So we use 10-fold to guarantee the diversity of the trained classifier.

2) SELECTION PHASE
For the accuracy of the fusion classifier, we select the most suitable classifiers for the target data set from the classifier pool. The first problem is how many classifiers should be chosen. We use k-means method to partition the data set into k clusters. Then use the silhouette coefficients that is correspondent to different k to find the best k. Let S j i denotes the silhouette coefficient of data item d i when we set k = j. The silhouette coefficient of data set D can be defined as formula (1). Then the best k can be got as arg k max S k D .
When people choose classifiers, they always tend to select classifiers which are most suitable for the target data sets. Because suitable classifiers can improve the generalization ability and accuracy of the final fusion classifier [6]. As the unstable classifiers are sensitive to the changing of training data set and learning data set, they tend to be the best base classifiers of combination strategies. They can get very different classification results even there are very subtle changes in the training samples. After we determine the number of base classifiers, we choose the same number of the most mutually inconsistent and the most accurate classifiers for the same data set to fuse as the final decision classifier. First, we set the accuracy threshold. Classifiers whose accuracy are higher than this threshold will be chosen as the spare classifiers. Then the final base classifiers will be chosen from these spare classifiers based on their difference operators.
The difference operator of one classifier C i can be calculated by formula (2).
Here, N represents the total number of the classifiers in the pool. For any sample which belongs to the mth class, if C i identify that it belongs to the mth class and C j identify that it belongs to another class, NC i ,C j add 1. For any sample which belongs to the mth class, if C j identify that it belongs to the mth class and C i identify that it belongs to another class, N C i ,C j add 1. Obviously, i ∈ [0, 1]. A large difference operator indicates the classifier has a large difference with all other classifiers. The k classifiers with the largest difference operators will be selected as the final base classifier.

3) FUSION PHASE
Average fusion, weight fusion, and majority vote fusion are the three most popular fusion strategies. But they all have some known defects. The average fusion and majority vote strategies are sensitive to extreme values. Their performance are uncertain. Researchers try to set different weights to different base classifiers on the basis of their features like accuracy. This is the main idea of the weight fusion strategy.
In our fusion strategy, the weights of the base classifiers are dynamically changing according to the feature of the target data sets. This dynamic adaptive weight strategy is implemented based on the classification confidences and the prior probabilities of the base classifiers.
The prior probabilities of the base classifiers can be computed by all of their history decision results on the same type of data sets. Suppose that l classifiers are chosen as the base classifiers of the fusion strategy, their prior probabilities can be described by the confusion matrix which is defined as formula (3).
If the ith classifier mistakenly partition one sample into the pth class and this sample belongs to the pth class, the value of z i pq will increase by 1. All of the history training results are counted. And the item values of CM are normalized.
The classification confidence of the fusion classifier can be computed by the final weights and the class-conditional probabilities of the base classifiers. For one classifier C i , we assume that p k i denotes its class-conditional probability of the kth class. Then the confidence of the final fusion classifier on the ith class is defined as formula (4).
Here, λ ki represents the kth fusion weight of the ith classifier. It can be get from the formula (5).
In formula (5), π i k = l p=1,p =k z i kp ρ i pl + l q=1,q =k e i qk ρ i qk . The definition of ρ is shown in formula (6). It denotes the inverse reliability of the ith classifier at the pth class according to the qth class.
The final classification decision can be got from the partition result of the fusion classifier which has the highest confidence.

IV. MSPMERAD
Limited by computing power and battery capacity, mobile phones cannot consistently provide sufficiently accurate results just based on local computation for computationally sensitive applications in a reasonable amount of time. So we use the remote cloud to supply assistance. The processing of the emotion recognition algorithm is offloaded to the cloud. Mobile phones are just responsible for the collection of sensor data and showing the recognition results. For the sake of the scalability of the emotion recognition cloud service, each component and classifier is encapsulated into one microservice which is implemented as one Docker Container. Fig 4 shows the architecture of our based emotion recognition system. All mobile phones which installed the emotion recognition client application EAmobile will collect users' behavior data through sensors. These data are transferred to the remote cloud and pre-processed by the pre-processing micro-service. Features that can indicate the emotional status will be extracted from the original data and stored into the HBase database on the cloud. The real-time emotional status of a mobile user can be perceive by the emotion recognition micro-service based on the real-time sensor data of its mobile phone and all historical sensor data of all mobile users and their corresponding emotional status. The recognition result will also be stored into the HBase database on the cloud. The pre-processing, features extraction, classifier implementation, choose of base classifiers and fusion are all encapsulated into micro-services based on the Kubernetes framework. Therefore, service providers can improve the accuracy of their emotion recognition services by only improving individual services, such as replacing classifiers and improving fusion methods, rather than rewrite all services.
In what follows, the useless data eliminating method using by the pre-processing micro-service is described firstly. Then, we explain the process of the feature extracting micro-service in detail.

A. DATA PRE-PROCESSING MICRO-SERVICE
In the process of collecting data, the mobile phone may encounter various problems that can cause the data can not be used directly. For instance, the ongoing data acquisition process may be interrupted by an incoming call. This incoming call will make the data not complete. The surrounding magnetic field may cause the compass data of the phone to be inaccurate. The external noise may interfere with the sounds collection sensor and then the collected data are noisy.   The data pre-processing micro-service can eliminate the noise, full the incomplete data, and delete the inaccurate data. Fig. 5 shows the process flow of the pre-processing microservice. The initial data which are incomplete, noisy or inaccurate will be replaced or filled with the reference data. These data represent the value of default, average or the neighbor data. If there are no these reference data, these data will be delete.

B. FEATURE EXTRACTING MICRO-SERVICE
Different sensors can collect different data. Some of these data indicate different emotional statuses of users. For example, when a user's sound is louder than usual and the frequency of the shaking of his phones is very high always foreshadow that he is activity. At the same time, if he pressed the phone's display screen harder than usual, he might be angry with someone. Different sensors have different data features that can be used to infer their user's emotional status. Therefore, we extract these features from the data which have been pre-processed to enhance the accuracy of the emotion recognition algorithms. The details of the features which are extracted by the feature extracting micro-service are the same as those in paper [10] (Table 2).

V. EVALUATION
In this section, we will describe the evaluation results of the proposed system. We construct an android application that is named EAmoible. It can collect daily sensor data of the user of its host phones. The sensors which are monitor is consistent with table 3 of the paper [10]. 50 volunteers are chosen from  the undergraduate students of Xidian University. They install EAmobile on their phones and keep it running for 7 months. They do some necessary labels on the daily data which can help us identify the mapping relationship between data and in time emotional statuses. Our remote cloud server collects these data every day. The pre-processing micro-service and feature extraction micro-service will handle these data and store them into the HBase database in the system. We use these data to train our fusion classifier and get the best weights. We choose 8 popular classifiers and put them into our classifier pool. Table 3 in paper [10] shows the original accuracy without fusion when these 8 classifiers are trained on our sample. To compare the effectiveness of our method, we use some normal classifiers: Linear Discriminant Analysis (LDA), Classification And Regression Tree (CART), REP, MLP, etc.

A. NUMBER OF BASE CLASSIFIERS
We change k from 2 to 9 and compute the silhouette coefficients of the data set after applying k-means clustering. Table 1 shows the results. We can know that, the silhouette coefficients achieves its maximum value when k = 3.
We choose 2, 3 and 4 classifiers as the base classifiers of our adaptive weight fusion classifier respectively. The recognition results are shown in Table 2. Group no. 4 and 7 have the best performance. This results are consistent with the results when we change the number of base classifiers. When we choose 3 base classifiers, higher performance could be reached. When we choose two base classifiers, the accuracy would not always be higher than when only one basic classifier is selected. For group no. 1, 2, and 3, the accuracy are improved. This is because they choose two classifiers which are different types that the mutual complementary is aroused. But for group no. 5 and 6, the accuracy drop because the two base classifiers come from the same class. The same rule can be got from the four base classifiers groups. Therefore, in the following experiments, we try to choose three classifiers from different classes as the base classifier for the fusion classification method.

B. EFFECT OF OUR FUSION STRATEGY
We do experiments using our strategy and another two fusion strategies and compare their results. The accuracy based method and the majority vote method are chosen. Data in table 3 shows the accuracy of these three strategies on our samples. All results are the average results of experiments that are repeated 5000 times. Results show that the proposed method can always reach the highest performance. We construct two classifier pools. The first one consists of REP, CART, and LAD. They have the largest average difference and accuracy. Thus, the performance of the accuracy strategy is better than that of the vote strategy. The system's accuracy is significantly improved when we replace the strategy as the proposed strategy. In the second group, DTNB is replaced by REP. REP possesses the lowest accuracy percentage. Then the system's emotion recognition accuracy is improved under all strategies.

C. EFFICIENCY OF THE MICRO-SERVICE PLATFORM
We compare the efficiency of emotion recognition in two computing modes, local computing and micro-service based computing. The computing time is got from the average computing time of 500 individual experiments. The average computing time of local computing mode is 3s. At the same time, the result of micro-service based computing mode is 50ms. Table 4 shows the parameters of the mobile phones and cloud servers we used in the experiments.

VI. CONCLUSION
A microservice-based emotion recognition system for mobile phone users is present in this paper. We train the dynamic adaptive fusion weights of base classifiers on the behavior data of 50 student volunteers which are collected by their mobile phone sensors. The highest average accuracy can achieve 72.37%. The micro=service platform can easy the construction of emotion recognition mobile applications. Application providers and algorithm researchers can improve their work without affecting each other. In the following work, we will open this architecture to APP developers for real application.