Human–Robot Collaboration Using Sequential-Recurrent-Convolution-Network-Based Dynamic Face Emotion and Wireless Speech Command Recognitions

The proposed sequential recurrent convolution network (SRCN) includes two parts: one convolution neural network (CNN) and a sequence of long short-term memory (LSTM) models. The CNN is to achieve the feature vector of face emotion or speech command. Then, a sequence of LSTM models with the shared weight reflects a sequence of inputs provided by a (pre-trained) CNN with a sequence of input sub-images or spectrograms corresponding to face emotion and speech command, respectively. Simply put, one SRCN for dynamic face emotion recognition (SRCN-DFER) and another SRCN for wireless speech command recognition (SRCN-WSCR) are developed. The proposed approach not only effectively tackles the recognitions of dynamic mapping of face emotion and speech command with average generalized recognition rate of 98% and 96.7% but also prevents the overfitting problem in a noisy environment. The comparisons among mono and stereo visions, Deep CNN, and ResNet50 confirm the superiority of the proposed SRCN-DFER. The comparisons among SRCN-WSCR with noise-free data, SRCN-WSCR with noisy data, and multiclass support vector machine validate its robustness. Finally, the human-robot collaboration (HRC) using our developed omnidirectional service robot, including human and face detections, trajectory tracking by the previously designed adaptive stratified finite-time saturated control, face emotion and speech command recognitions, and music play, validates the effectiveness, feasibility, and robustness of the proposed method.


I. INTRODUCTION
Recently, different kinds of robots have been developed to fulfill human-robot collaborations. Some representative and outstanding works are reviewed as follows. A robot that can understand and express emotions in voice, gesture, and gait by a controller trained only on voice is developed such that the robot can recognize happiness, sadness, and fear in a completely different modality [1]. A humanoid robot's The associate editor coordinating the review of this manuscript and approving it for publication was Yangmin Li . visual imitation of 3-D motion of a human is developed by a neural-network-based inverse kinematics [2] or support vector machine for the classification of 11 low-body postures [3]. In [4], the suggested robot Mortimer including social behaviors can increase engagement and social presence; in addition, the effect of extending weekly collocated musical improvisation sessions is investigated by making Mortimer an active member of the participant's virtual social network. In [5], specific human following through machine learning of SSD-FN-KCF is developed. In [6], musical robots are designed to control those dynamics, articulation, and tempo to give the audience an experience as compared with listening to a professional human musician. Wolfe et al. [7] develop a singing robot platform that could interact with surrounding humans by communicating through song, musical, nonlinguistic utterances to evoke emotional responses in humans. An auxiliary online diagnosor using the Bayesian decision theory provides not only a collision identification for humancollaborative robots but also a confidence index to represent their reliability [8]. Additionally, many researchers have been committed to the robots with the ability to detect and identify human emotions, and then apply this information to guide their own behaviors, which are called affective intelligent robots [9], [10], [11], [12], [13]. Since face emotion plays the most important role, the issue of making affective intelligent robots with accurate and real-time face emotion recognition becomes a challenging task. Many related articles examine the classification of facial expression image into several typical classes: angry, disgusted, fearful, happy, surprised, and sad [14].
It is known that the detection of human emotions from facial emotions is crucial for social interaction. The proposed sequential recurrent convolution network (SRCN) improves the corresponding drawbacks and simultaneously enhances its performance, and then applies to human-robot collaboration (HRC) task. From the outset, the stereo camera on the omnidirectional service robot (ODSR) is planned to search and detect the human by Faster R-CNN [15]. If a face candidate exists, the detected face using Haar Cascade feature descriptor is cropped as a suitable size to recognize his/her face emotions. If not, the strategy to approach the above pose region is achieved by the stereo vision based localization [16], [17] and an stratified finite-time saturated control [18], [19]. Subsequently, the facial emotion is recognized by the SRCN-DFER. Stereo camera not only estimates the 3D position up to 20m, but also improves the recognition rate since the use of left and right cameras increases FOV to achieve a better recognition [16]. A dynamic recognition rate for video to indicate the stabilized facial emotion with a specific time interval is also defined to meet the requirement of HRC.
In contrast, another SRCN for wireless speech command recognition (SRCN-WSCR) is developed to deal with more complex HRC task. With this, eight speech commands from Google Speech Commands Dataset are employed to train and verify the SRCN-WSCR. At the outset, the sampled speech command is transferred into the frequency domain signal by Short-Time Fourier Transform (STFT) with suitable window length and hop length. Multiplying the power spectrum of STFT signal by Mel filter matrix obtains the logarithm of Mel-Spectrogram [19], which is set as the input signal of SRCN-WSCR. Since the video sequence for facial emotion is limited, the large amount of facial emotion images is applied to train the CNN, which is the first part of SRCN. After that, these weights in CNN without fully and softmax layers are assigned as the partial initial weights in the SRCN-DFER. The other initial weight for a stack of LSTMs is set as a small random number. In contrast, eight designed speech commands have many dynamic files, its pre-trained CNN is not required. In summary, the proposed approaches not only effectively tackles the recognitions of dynamic mapping of face emotion and speech command, but also prevents the overfitting problem in the presence of noises. Finally, the HRC by omnidirectional service robot, including human and face detections, trajectory tracking using adaptive stratified finite-time saturated control, face emotion and speech command recognitions, and music play, validates the effectiveness and robustness of our method.

II. RELATED WORK
As a pattern recognition task, there are plenty of classification methods that can be adopted to classify facial emotions [11], [12], [13], [20], [21], [22], [23], [24], [25], [26], [27], [28]. In [11] and [12], three-and two-layer fuzzy support vector regression-Takagi-Sugeno models are suggested for the emotion understanding in human-robot-interaction (HRI) task, e.g., the drink reflecting different emotions, human following [5]. Their average video-based recognition rates for different genders, provinces, and ages are ordinary. The aims of [13] is to make good use of the CNN's potential performance in avoiding local optima and speeding up the convergence by the hybrid genetic algorithm with optimal initial population, in such a way that it realizes deep and global emotion understanding in HRI. Nonetheless, its average video-based recognition rate is usual. In [20], multi-modal recurrent attention networks learn spatiotemporal attention volumes to robustly recognize the facial expression. Besides the sequent RGB images, the depth and thermal sequences are also required. In [21], a deep learning framework based on the hybrid of 3D conditional generative adversarial network and two-level attention bidirectional long short-term memory network has been proposed for robust driver drowsiness recognition. However, the averaging recognition rate in different situations is only acceptable. In [22], a 3D-CNN is first designed to capture subtle spatiotemporal changes that may occur on the face, and a Conv-LSTM network is then designed to learn semantic information by taking into account longer spatiotemporal dependencies. Although the recognized result is acceptable, the proposed scheme is complex. A two-branch disentangled generative adversarial network disentangles expressional information from other unrelated facial attributes [23]. Although the average imagebased recognition rate for the datasets of CK+, TFEID, and RaFd is excellent, its generalized recognition is poor. In [24], a correlation-based graph convolutional network for automatic emotion recognition (ARE) is developed, which can comprehensively consider the correlation of the intra-class and inter-class videos for feature learning and information fusion. However, the average recognition rate of ARE is normal and there are large variations for different datasets. In [25], modeling pose variations in facial images to boost the performance of face emotion recognition is achieved by an end-to-end weakly supervised approach. However, its average recognition rate and generalization are not excellent. In [26], the Learnable Graph Inception Network, that jointly learns to recognize emotion and identify the underlying graph structure in the dynamic data, is developed. It possesses satisfactory average recognition for RML, eNTERFACE, and RAVDESS datasets. In [27], event-cameras can capture motion at millisecond-rates, work under challenging conditions like low illumination and understand human reactions by only observing facial expressions. Even a combination of CNN and Bi-LSTM for dealing with face emotion recognition [28] failed to accomplish a satisfactory performance due to a lack of effective data sets for training. Besides the above researches, a multimodal fusion framework for noncontact heart rate (HR) estimation, including the feature representation maps from facial visible-light and thermal infrared videos, a temporal information-aware HR feature extraction network for encoding discriminative spatiotemporal information is accomplished [31]. It indicates that face recognition can be adopted for different applications [9], [10], [11], [12], [13].

B. PROBLEM DESCRIPTION
At first, the ODSR searches and detects the human through Faster R-CNN [15]. If the face is in the orientation of −45 • ∼ 45 • with respect to the optical axis and the position is less than 3.5m, the Haar Cascade feature descriptor is employed to crop a suitable face for recognizing his/her face emotion (e.g., angry, disgusted, fearful, happy, surprised, and sad). If not, the strategy to approach the above pose region is achieved by the stereo vision based localization and an adaptive stratified finite-time saturated control (ASFTSC) [19]. Subsequently, the face emotion is recognized by the proposed SRCN-DFER. Based on the FOV, face candidate in the R1, R2, R3, or R4 depicted in Fig. 2 is searched by the ODSR. It indicates that the initial optical axis of the FOV in region R1 is 0 • . If a human is not detected through the Faster R-CNN, then the optical axis of FOV rotates θ R2 = −90 • to detect the human candidate in region R2. If a human is still not detected, the optical axis rotates θ R3 = −90 • to detect the human candidate in region R3. Likewise, if human is not detected, the optical axis rotates θ R4 = −90 • to detect the human candidate in region R4. If the ODSR can't find the human from the above searching strategy, it moves forward a specific distance (e.g., 5m) to execute the same procedure. If a human is detected, the center point of the bounding box for detected human is estimated by stereo vision system to achieve the 2D pose between the detected human and ODSR. The overall flowchart of HRC using the proposed SRCN-DFER and SRCN-WSCR is also depicted in Fig. 3.

IV. SRCN-DFER AND SRCN-WSCR A. SRCN-DFER
The proposed architecture of SRCN-DFER is depicted in Fig. 4, which has the upper part for the architecture (e.g., Convolution, Max-pooling) and the lower part for the output. The designed concepts of SRCN-DFER are described as follows: (i) Five pairs of the conv-pooling with appropriate size are made up of the main part of CNN [40] such that the classification of face emotion is improved. (ii) The max-pooling is often applied to reduce the unnecessary calculation. Nevertheless, the size of max-pooling should be not too large to avoid information loss. The size of 2 × 2 is suitable. Multiscale of convolution kernel (i.e., 7 × 7, 5 × 5, 3 × 3) for face emotion recognition are suitable [16]. (iii) The zero padding of feature maps can better utilize their border information, which is beneficial for the final performance. (iv) From Table 1, the weight of the fully connected layer and LSTM layer possesses the main number of total weight. Nevertheless, the total number is still smaller than that of DCNN [16] (cf. Table 5) or the ResNet50 in [25].
(v) The symbol > 1 denotes the number of LSTMs to tackle the dynamic mapping problem since the each LSTM contains feedback loop [20], [22], [28], [35], [41]. Moreover, these LSTMs have common weight. (vi) With the online preprocessing mechanism, i.e., Faster R-CNN combined with Haar Cascade feature descriptor, the proposed method is more practical in comparison to some studies [23], [25], which must have the suitable faces cropped in advance.
The details of four datasets are given as follows:  The training procedure of SRCN-DFER is described as follows: (i) The CNN with fully connection and softmax layers is first trained by static images of 3 datasets: NTUST-IRL, KDEF, and JAFFE. (ii) After that, a pre-trained weight of CNN but without the fully connection (FC) and softmax (SM) layers is a part of initial weights. Together with the other small random initial weights are employed to train SRCN-DFER. (iii) Subsequently, the overall weight of SRCN-DFER is trained by 280 batches of the sequence images in CK+ dataset.
The loss function of categorical-cross entropy is used for the learning of SRCN-DFER [42]: where t m is the target signal of the m − th facial emotion, M is the total number of facial emotions, P = (p 1 , p 2 , · · · , p M ) is the probability vector of classified output. Based on (1), the stochastic gradient descent (SGD) of ''Adam'' (2) is applied to learn the corresponding weights in the CNN except a stack of LSTMs or SRCN: where η i (k) is the initial rate 0.01 with the decay rate 10 −5 , g(k) is the gradient vector, ε = 10 −5 avoids a zero division. Moreover, we havê where g i (k) = ∂L ∂ŵ i , α 1 = 0.9, α 2 = 0.99. Based on (2) and (3), the pre-training curve of CNN for 3 datasets with the number of static images 4354 and 1088 for training and testing is given in Fig. 5, which possesses the final training loss 0.0057 and testing loss 0.0083 after 1063 steps. The result is satisfactory due to the consistence between training loss and testing loss. Further validations will be given in section V. Since the sequence images for six face emotions are only 280 batches, the overall training curve for SRCN-DFER with = 10 is shown in Fig. 6, which has the training loss of 3.2 × 10 −5 after 1043 steps. Since the number of face emotion sequences is much smaller than that of learning weight, we neglect the response of testing loss. Finally, the VOLUME 11, 2023 37273 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   proposed SRCN-DFER is described in Algorithm 1, possessing the similar concept of few-shot object detection [43]. The advantages of stereo camera are elaborated in the following. When the human is in the front of stereo camera, their total FOVs is much larger than 90 • of mono camera (cf. Fig. 7). Because the facial emotion recognition is sensitive to the view angle of camera, the better recognition rate from them is the representative one. It is definitely better than that of mono camera at the expense of more processing time. Nevertheless, the increasing processing time is acceptable. The experimental video for the comparison between single and stereo cameras can refer to the URL: https://youtu.be/qgR7vyokSPo. It indicates that if the orientation of human face is larger than 15 • with respect to the optical axis, the recognition for mono camera always fails. In contrast, stereo camera still successes. Since the face emotion recognition using stereo vision system is complex and seems unnecessary, the simultaneous comparison between left and right cameras with sharing the same learned weights and increasing viewing angle can improve the recognition rate. This advantage was rarely addressed in previous research.

Algorithm 1 SRCN-DFER algorithm
Input: Subimage 100 × 100 from preprocessing; Output: Classification of 6 dynamic face emotions DFE i , = 1, 2, · · · , 6. 1: Using a set of static images trains and tests the weight of CNN in Table 1 by ''Adam'' SGD optimizer (2) and (3), and small random initial weight. 2: Using the pre-trained weight of CNN and small random initial weight for LSTM model, FC and SM layers trains the SRCN-DFER in Table 1 with suitable by ''Adam'' SGD optimizer (2) and (3), and a sequence of dynamic face emotion images. 3: If the classified result is not satisfied, then it is back to step 2. 4: Output one of DFE i , = 1, 2, · · · , 6. For example, ''Left'' and ''Right'' respectively command the ODSR at the left-and right-hand side of human with 2.5m between them. Likewise, ''Forward'' and ''Backward'' respectively command the ODSR in the front and rear side of human with 2.5m between them. Certainly, ''Stop'' command immediately stops the ODSR. The preprocessing of speech command is described in Fig. 8 or Algorithm 2. The architecture of the proposed SRCN-WSCR is described in Table 2, which is simpler than the SCRN-DFER in Table 1. VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  Before the online application of speech command recognition, the training data from Google Speech Commands Dataset v0.02 is employed to train and test SRCN-WSCR algorithm. It includes (i) over 100,000 speech files with the time length of 1s for 35 classes, (ii) speech files with 6 different background noises. Based on ''Adam'' SGD optimizer, the training and testing losses of the SRCN-WSCR with = 101 are respectively 0.013 and 0.071 after 752 iterative steps by the number of training and testing voice files of 10372 and 2324. Since the dynamic feature of speech command is dominant, the number of LSTMs is increased in comparison to that in the SRCN-DFER. Since these voice files are sufficiently large and dynamic, no pre-trained response is given. Finally, Algorithm 2 is online applied to the wireless speech command recognition [19]. In its Step 3, is a formulation of common use to convert linear frequency to the Mel-scale frequency such that human speech is more easily distinguished.  Table 2 with suitable and ''Adam'' SGD optimizer (2) and (3). 6: Output one of WSC j , j = 1, 2, · · · , 8.

V. EXPERIMENTAL RESULTS AND DISCUSSIONS A. VIDEO-BASED FACE EMOTION RECOGNITION
Most of video-based recognition rates of previous research (e.g., [11], [12], [13], [20], [28]) are only acceptable.  Before the implementation of HRC, the video-based recognition rate between mono and stereo cameras with three view angles −15 • , 0 • , and 15 • at 3m are compared in Table 3.
The important observations of Table 3 are addressed as follows: (i) The average recognition rate of stereo camera is 98.4% (URL: https://youtu.be/qgR7vyokSPo), which is 4.9% better than that of mono camera. Moreover, it is better than that of previous research, e.g., [11], [12], [13], [20], [21], [22], [23], [24], [25], [26], [27], and [28]. (ii) As view angle is zero, i.e., the face in the right ahead of camera, the recognition rate of stereo camera is 0.6% slightly better than that of mono camera. Nevertheless, the recognition rate at view angle −15 • for stereo camera is 8.6% better than that of mono camera. (iii) The face emotions in Table 3 have not only right-left view angle changes but also up-down pose variations. (iv) In summary, the stereo camera detects face emotions separately, and the higher confidence is the final output result. Using the advantages of stereo camera for the recognition of the face emotions with different view angles, including left-right and up-down pose changes, yields a better result for the distance between 0.8 and 3.5m. (v) Although the previous study [28] has a satisfactory average recognition rate of 84.32%, its ''fearful'' emotion is the lowest (59.09%). On the contrary, the ''fearful'' in Table 3 at least has 82%.
The recognition rates for the SRCN-DFER at different distances are shown in Table 4, which is still excellent.
Furthermore, the comparisons among DCNN [16], the proposed SRCN-DFER, and ResNet50 [25] are presented in Table 5. The architecture of ResNet50 includes 49 convpooling layers and the last fully connected layer for the classification. The important observations of Table 5 are  discussed as follows. (i) The proposed SRCN-DFER is 5.7% and 6% better than DCNN and ResNet50, respectively. (ii) The experimental video for ResNet50 is at the URL: https://youtu.be/tWigt50F_7M, which is acceptable. (iii) The computation time of SRCN-DFER is 0.015s averagely larger than that in DCNN but 0.02s smaller than that in ResNet50. The main reason is that 10 LSTMs with the same weight are required for the proposed approach. (iv) The recognized results of ''Disgusted'' and ''Fearful'' are improved by 19% [16] and 19.5% [25], respectively. (v) Two different persons, which have pose variations and are not in the training dataset, and slightly different backgrounds, are employed to further confirm the effectiveness of the proposed SRCN-DFER. The average recognition rate 97.6% is still excellent, cf. URL: https://youtu.be/Kz3fC0tjLLE. (vi) The learning weight in SRCN-DFER is only 4% and 11.3% in comparison to DCNN and ResNet50 such that overfitting problem can be reduced.
In Table 6, LRCN [29] and 3D-CNN [30] directly use many static sequential image for training; in contrast, SRCN-DFER has been pre-trained by static images of 3 datasets: NTUST-IRL, KDEF, and JAFFE. Then, the pre-trained weights of CNN and the small random weights for a stack of LSTMs, fully connection and softmax layers are set as the initial weight to train the SRCN-DFER by a sequence of dynamic images from CK+ dataset. From Table 6, it reveals that the SRCN-DFER through ''Transfer Learning'' can obtain a better performance due to extracting useful information from data in a related domain and transferring them used in target tasks [44].

B. WIRELESS-BASED SPEECH COMMAND RECOGNITION
At the outset, the confusion matrix of SRCN-WSCR without noisy training data by the test data from Google Speech Commands Dataset v0.02 is shown in Table 7, which is excellent. To verify its robustness, its confusion matrices using the original test data combined with metal and chopstick crashing noise and hand clapping noise are respectively presented in Table 8 and Table 9, which are satisfactory. To boot robust recognition of speech command, six noisy data from Google Speech Command Dataset are added to train SRCN-WSRC. Then, the confusion matrices of the SRCN-WSCR with noisy training data [25] for Table 8 and Table 9 cases are respectively presented in Table 10 and Table 11, which are much improved. It confirms the superiority of the proposed SRCN-WSCR.

C. HUMAN-ROBOT COLLABORATION
The resolution and sampling rate for this study are (3840 × 1080) and 30 FPS, respectively. In addition, the humanrobot collaborations in Table 12 are exemplified in the following 6 scenarios. (i) In the beginning, ODSR and human  are at (x 1 , y 1 , 180 • ) and (x 2 , y 2 , 0 • ), respectively, where the position is meter. Since the ODSR in Region 1 does not detect a human over 10s, based on the searching strategy in Fig. 2 ODSR rotates 90 • in the clockwise (CW) orientation by the ASFTSC in [19]. (ii) In Region 2, a human is detected, and then ODSR is controlled to (x 2 , y 2 − 2.5, 90 • ). (iii) No face over 10s is detected by Faster R-CNN on ODSR. Then ODSR will broadcast ''Where is your face orientation?'' (iv) The human answers ''Right'', which indicates the orientation of human face in the right hand side of ODSR. After it is recognized by SRCN-WSCR, ODSR is controlled to (x 2 + 2.5, y 2 , 180 • ) in the alignment with the human face. (v) Likewise, the speech command ''Left'' is recognized by SRCN-WSCR, and ODSR is then controlled to (x 2 − 2.5, y 2 , 0 • ) in the alignment with the human face. (vi) If the speech command ''Forward'' is recognized by SRCN-WSCR, ODSR is passing through the waypoint (x 2 + 2.5, y 2 , 90 • ), and then is controlled to (x 2 , y 2 − 2.5, −90 • ) in the alignment with the human face.  The operations in Table 12 do not discuss the ''Backward'' command since ODSR can detect a face at this status and the distance between is about 2.5m. The assigned distance of 2.5m is due to the environment constraint. Furthermore, the speech command ''Stop'' will stop the motion of ODSR in any circumstance. After facial emotion is recognized by SRCN-DFER, ODSR will broadcast ''Are you (recognized emotion)?'' Finally, the ''Yes'' or ''No'' speech command from human will be answered via wireless transmission. If ''Yes'', the corresponding music will be playing. Otherwise, the continuous recognition by SRCN-DFER is implemented. The proposed approach is different from the relative pose estimation between two robots using an optimal Kalman filter [45] since it must have an operation with limited FOV.
The operations in Table 12 do not discuss the ''Backward'' command since ODSR can detect a face at this status since ODSR is just in the front of human with the distance of 2.5m. The assigned distance of 2.5m is due to the environment constraint. Furthermore, the speech command ''Stop'' will stop the motion of ODSR in any circumstance. After facial emotion is recognized by SRCN-DFER, ODSR will broadcast ''Are you (recognized emotion)?'' Finally, the ''Yes'' or ''No'' from human will be answered via wireless transmission. If ''Yes'', the corresponding music will be playing. Otherwise, the continuous recognition by SRCN-DFER is implemented.
The experimental video for human-robot collaboration is at the https://www.youtube.com/watch?v=J3DF30TdlzE. One representative human-robot collaboration with the ''Happy'' face emotion and its motion control response are presented in Table 13 and Fig. 10, respectively. They are explained in the following 13 portions. (i) To begin with, the wireless speech command ''Follow'' is received by ODSR. (ii) Based on SRCN-WSCR, the planned human-robot collaboration executes. Since the FOV of ODSR is in Region 1, no human is detected. (iii) Based on the searching strategy in Fig. 2, ODSR turns 90 • in the clockwise (CW) orientation to Region 2 for the continuous face detection (see the 3 rd subplot of Fig. 10(a)). (iv) A human in Region 2 is detected by Faster R-CNN on ODSR. (v) ODSR is controlled to 2.5m between them (see the 1 st and 2 nd subplots of Fig. 10(a) at t = 57s or Fig. 10(b)). (vi) Simultaneously, face detection (FD) using   Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  Table 13 by the ASFTSC in [19].
Haar Cascade descriptor is implemented. Since no face is detected, ODSR will ask ''Where is your face orientation?'' (vii) Human answers ''Left'' to ODSR. (viii) After the use of SRCN-WSCR, ODSR moves to left side of human and turns 90 • in the CW orientation to detect a face (see the 3 rd subplot of Fig. 10(a)). (ix) After a face is detected by Haar Cascade feature descriptor, ODSR applies SRCN-DFER to recognize the human's face emotion. (x) ODSR will broadcasts ''Are you happy?'' (xi) Human answers ''Yes'' to ODSR. (xii) The corresponding music reflecting ''Happy'' emotion is playing. (xiii) In this experiment, the camera axis is the same as the motion axis of ODSR, i.e., Y-axis. The control response achieved by the adaptive stratified finite-time saturation control [19] is shown in Fig. 10(c). The simultaneous translation and rotation of ODSR is better than that of differential mobile robot [39], or car-like mobile robot [46].

VI. CONCLUSION
A creative design of SRCN for dynamic mapping of many machine learning problems, e.g., dynamic face emotion recognition, wireless speech command recognition, is established. From the outset, the CNN with fully connection and softmax layers is trained by static images to achieve the corresponding feature vector of facial emotion. Subsequently, SRCN-DFER with a stack of 10 LSTMs using the shared weight is trained by 280 batches of dynamic face emotion images. It is similar to the few-shot concept and achieves an average 98% recognition rate for different persons with pose variation and slightly different backgrounds. The performance is superior to many previous studies for dynamic face emotion recognition [11], [12], [13], [20], [21], [22], [23], [24], [25], [26], [27], [28]. Furthermore, the comparisons among DCNN [16], ResNet50 [25], and LRCN [29], and 3D-CNN [30] confirm the state-of-the-art performance. Since the files of speech command are sufficiently large and dynamic, a pre-trained CNN for SRCN-WSCR is not required. In contrast, its 101 LSTMs are larger than 10 LSTMs in the SRCN-DFER due to the strong dynamics of speech command. The proposed approaches not only effectively tackles the recognitions of dynamic mapping of facial emotion and speech command, but also prevents the overfitting problem in the noisy environment. Finally, the implementation of HRC, e.g., Table 13 and Fig. 10, is accomplished by the integration of trajectory tracking control of ODSR, searching and detection of human and face, preprocessing of speech command, dynamic face emotion and wireless speech command recognitions, and music playing. In the future, multiple ODSRs and humans, distributive UWB network for wireless navigation will be addressed.