Real-Time Implementation of Face Recognition and Emotion Recognition in a Humanoid Robot Using a Convolutional Neural Network

Robots can mimic humans, including recognizing faces and emotions. However, relevant studies have not been implemented in real-time humanoid robot systems. In addition, face and emotion recognition have been considered separate problems. This study proposes a combination of face and emotion recognition for real-time application in a humanoid robot. Specifically, face and emotion recognition systems are developed simultaneously using convolutional neural network architectures. The model is compared to well-known architectures, such as AlexNet and VGG16, to determine which is better for implementation in humanoid robots. Data used for face recognition are primary data taken from 30 electrical engineering students after preprocessing, resulting in 18,900 data points. Emotion data of surprise, anger, neutral, smile, and sad are taken from the same respondents and combined with secondary data for a total of 5,000 data points for training and testing. The test is carried out in real time on a humanoid robot using the two architectures. The face and emotion recognition accuracy is 85% and 64%, respectively, using the AlexNet model. VGG16 yields recognition accuracies of 100% and 73%, respectively. The proposed model architecture shows 87% and 67% accuracies for face recognition and emotion recognition, respectively. Thus, VGG16 performs better in recognizing faces as well as emotions, and it can be implemented in humanoid robots. This study also provides a method for measuring the distance between the recognized object and robot with an average error rate of 2.52%.


I. INTRODUCTION
Recently, the rapid growth of technology has advanced research in the field of robotics, including research on humanoid robots. A humanoid robot is a human-shaped robot equipped with a body, hands, head, and so on. Usually, a humanoid robot has the capability to interact with humans, The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . such as recognizing the human and responding to commands given by the human.
Humanoid robots are usually expected to be socially assistive robots. Thus, face recognition is an important matter in human-machine interaction. Robots capture a human's face through a camera embedded as the eyes.
Face recognition is a technology with the capability of identifying or verifying the subject's identity in the form of images or video [1]. Some technologies have been developed to recognize faces. A.E. Omer and A. Khuran implemented facial recognition using principal component analysis (PCA) [2]. Then, Ebeid [3] compared two methods, multilayer perceptron and a radial-based function with eigenface feature extraction, for face recognition. Sanjaya et al. [4] developed a social robot that can recognize and track the human face. In their study, they used the cascade classification method (Viola-Jones method) and a local binary pattern histogram (LBPH). However, only a few samples were used, and illumination conditions were not considered. Cilmi and Mercimek [5] also used the Haar cascade classifier to detect faces and the Kanade-Lucas-Tomasi feature tracker by considering neck movement. Zhao and Wei [6] proposed an LBPH algorithm based on neighborhood gray median (MLBPH) to improve LBPH in terms of illumination, expression, and attitude deflection. In another study, Zhi and Liu [7] used PCA to extract the features in grayscale images, a genetic algorithm to optimize the network weights of face features, and a support vector machine (SVM) as a classifier [8]. Borkar and Kuwelkar [8] also utilized PCA in combination with linear discriminant analysis (LDA) to reduce dimensionality. This method was implemented on the AT&T database. Fontaine et al. [9] modified the robust sparse coding algorithm to recognize labeled faces in the Wild database.
These methods depend on the face data features. Thus, feature extraction is crucial in determining the success of recognition. However, feature extraction may reduce the dimensionality of the original dataset, which may remove some important information. Thus, in recent studies, deep learning algorithms, such as convolutional neural networks (CNNs) [10] and deep CNNs [11], have been used to improve face recognition accuracy.
In addition to face recognition, emotion recognition can also be considered a major ability of machines in human-machine communication [12], [13]. Emotion recognition can be performed based on speech and facial expressions [14] as well as text [15]. For human-robot interaction, facial expressions are very important since they can carry various pieces of information [16]. Nicolai and Choi [17] discussed facial emotion recognition in the context of a fuzzy system. Adeyanju et al. [18] evaluated the performance of different support vector engine kernels for facial emotion recognition. Ahmed et al. [19] performed facial emotion recognition methods using a CNN and data augmentation by combining various datasets. Ruiz-Garcia et al. [20] combined a CNN and SVM to recognize emotions using the KDFF dataset. In their study, Faria et al. [21] used a geometrical feature based on log-covariance and angles formed by facial landmarks. Other studies considered static images [22], [23] or implemented deep learning to support emotion recognition in humanoid robots. For example, Mehendale [24] used a CNN, Li et al. [16] utilized a combination of a CNN and long short-term memory (LSTM), and [25] proposed a conditional generative adversarial network. [26] proposed a method of recognizing the face expression using a deep convolutional neural networks model which has input coming from the local gravitational force descriptor as features. Meanwhile, [27] used a convolutional neural network to distinguish facial expressions, namely FER-net which has been tested in five datasets.
These methods perform quite well in face recognition and facial emotion recognition tasks but are still limited since face recognition and emotion recognition have not been combined into a single recognition system, and were instead considered as different cases. However, face recognition and emotion recognition should be implemented as a unit to improve the capacity for human-robot interaction. In addition, the importance of the position of a human as an object to interact with was not considered in these studies. Furthermore, only a few studies have implemented face recognition or emotion recognition in real time [5], [16], [21], [28], [29]. Thus, this study aims to treat face recognition and emotion recognition as a unit so that robots can interact with humans by recognizing their names and emotions in real time as well as their position. Different from other studies that utilized a previously collected dataset, this study used primary data obtained from male and female students, where some students wore glasses and some female students wore a hijab. The contributions of this study are as follows: a. Face recognition and emotion recognition are combined into one unit, and the recognition system is embedded in the robot so that it can interact with a human based on his or her face and emotions in real time. b. The performances of well-known CNN architectures, i.e., VGG16 and AlexNet, are compared with that of the proposed modified architecture. c. A method is proposed for measuring the distance between the object's face and the position of the robot so that the robot can determine where the human is. This paper is structured as follows. Section 2 provides the method used in this study. The results and discussion are presented in Section 3. Finally, this paper is concluded in Section 4.

A. HARDWARE DESIGN
In this study, several pieces of hardware are used to support the implementation of a humanoid robot: 1. Webcam 2. JX Servo 60KG 3. Arduino 4. Raspberry Pi 5. Dot matrix The positions of the components used in this study are shown in Figure 1 (A-D).
The face images are captured using a webcam embedded as the robot's eyes. A dot matrix is used to present characters such as lines and circles. Such characters represent the form of eyes. JX Servo functions as the neck of the robot so it  can move to follow the position of the human's face after detecting and recognizing it.

B. FACE AND EMOTION RECOGNITION ALGORITHM SYSTEM DESIGN
The design of the recognition system embedded in the robot for recognizing a person's face is shown in Figure 2.  As shown in Figure 2, the first stage is camera detection of the face to obtain the face images as a dataset for training. The CNN used in this study consists of AlexNet [30] and VGG16 [31]. These architectures are chosen because they have shown good performance in face recognition [32] and emotion recognition [33]. The architecture of AlexNet and VGG16 can be seen in Figure 3.
In addition, the AlexNet architecture is modified by changing some parameters, as shown in Table 1.
In addition to face and emotion recognition, the distance is also measured to detect the position of the object. Specifically, the x and y coordinates between faces/emotions and the humanoid robot are measured. When recognizing faces and emotions, the frame represents 4 variables (x, y, w, and h), where x and y are the bounding boxes x for the upper-left side and y for the lower-right side and w and h are the width and height of the bounding box, which are processed to obtain the end x and end y values to determine the values of kdX and kdY. The calculation of coordinates x and y is as follows: where kdX is the x coordinate and kdY is the y coordinate, StartX is the starting point of the X-axis in the bounding box, StartY is the starting point of the Y-axis in the bounding box, EndX is the end point of the X-axis in the bounding box, and EndY is the end point of the Y-axis in the bounding box. Additionally, the distance from the recognized face to the camera embedded in the robot is calculated as follows: where f is the focal length, w is the width in pixels, d is the distance in cm, and W is the width in cm.
The values of the x and y coordinates as well as distance are essential when the humanoid robot interacts with the human.

C. SYSTEM EVALUATION
Evaluation is performed to determine the performance of the developed face and expression recognition system. Performance includes accuracy measured as the recognition rate of the proposed system for faces and emotions in real time. The formula for calculating the accuracy of the test is presented in formula (1): where TP represents true positives, TN represents true negatives, FP represents false positives, and TN represents true negatives. This formula is used to obtain face recognition or emotion recognition accuracy. A calculation of the accuracy value shows the level of effectiveness per class of a classification.

A. IMPLEMENTATION OF THE HUMANOID ROBOT
The humanoid robot includes mechanical design and overall wiring of the components used, such as the dot matrix, which is used to display the visual appearance of the eyes of the humanoid robot, and the JX Servo, which is used to move the head of the humanoid robot when looking for the facial position of a person to be recognized, and a camera module connected to a laptop that is used to perform face and emotion recognition. The specifics of the humanoid robot design are shown in Figure 4(a-c).

B. DATASET COLLECTION
In this study, facial datasets were obtained from 30 Univeritas Sriwijaya students, consisting of 21 male students and 9 female students. All the students gave their permission to use their face data. An example of face data is provided in  . The data were obtained using a webcam with a resolution of 640 × 480 pixels. The data were then processed and extracted to perform face recognition and emotion recognition on the images. The primary dataset includes 50 data points per class. Furthermore, the data were processed and extracted to produce 18,900 data points, with each class totaling 630 data points. The percentage of data used as training data is 80%, i.e., 15,120 data points, and that of the test data is 20%, i.e., 3,780 data points. Emotion data were also taken from the same respondents. To add variation to the data, the dataset from Kaggle [35] is used in this study. Examples of emotions from Kaggle can be seen in Figure 6. The facial emotions used in this study consist of 5 expressions, namely, smile, anger, surprise, neutral and sad. A total of 4,000 training data points are used, and 1,000 test data points were used.
The collected training data and test data are then preprocessed. This preprocessing includes cropping each image to remove any background or regions other than the face.
After the cropping process, image resizing is carried out by changing the image size from 640 × 480 pixels to 120 × 120 pixels. This is done to reduce the size of the dataset used during training. Then, the next process is to perform image augmentation to make the image variations fit a range = 0.1, shear_range = 0.9, width shift range = 1 and height shift range = 0.01.

C. FACE RECOGNITION
In this study, the proposed architecture model (model C) was compared with AlexNet (model A) and VGG16 (model B) using 500 epochs. The training loss for each class can be seen in Figure 7. As shown in Figure 7, the VGG16 architecture has a lower training loss than AlexNet and the proposed model alone. The training losses of the proposed architecture and AlexNet are 0.011, 0.151 and 0.052, respectively. These results indicate that the VGG16 architecture can provide better performance than AlexNet and its model for face recognition. Nevertheless, the average loss for model C is close to model B, which is 0,052. This indicates that the proposed model architecture (model C), which is a modification of AlexNet, has a smaller loss compared to AlexNet. In addition, model C has smaller parameters, so its training time is faster than that of model B. Table 2 shows the accuracy obtained with the test data. The model using the VGG16 architecture performed much better than AlexNet and our model. The VGG16 architecture recognizes all test data samples, and our model can recognize the face with an accuracy of 95%. These results indicate that the VGG16 model and our model can recognize faces well. Additionally, AlexNet recognizes 86% of the test data. An error occurred when recognizing samples 2, 8, 10, and 18. The cause of the error is the similarity of the facial features. For example, male student 8 has a similar face to male student 29, as shown in Figure 8.

D. EMOTION RECOGNITION
The training loss results for emotion recognition are shown in Figure 9. The training losses for AlexNet, VGG16 and our model architecture were 0.352, 0.1875 and 0.301, respectively. These results show that the VGG16 architecture is superior to AlexNet and our model. This may be due to the parameters used in the proposed architecture and the number of layers contained in each architecture. Table 3 shows the accuracy obtained from the test data. As shown in the table, the emotion recognition accuracy is lower than that of face recognition. This may be caused by the similarity of the emotions of each individual. As shown in Figure 9, the emotion of anger is similar to surprise. The best accuracy is obtained by model B (VGG16), with an accuracy of 82%. Additionally, the accuracy of the proposed model (model C) is 71%. It is better than model A (AlexNet) and lower than model B. Even though its accuracy is lower than model B, the proposed model has a smaller number of layers, so the training process is faster compared to model B. These results indicate that the VGG16 model and our model can recognize emotions well.

E. REAL-TIME TESTING ON THE HUMANOID ROBOT
Real-time testing was carried out directly on the camera module attached to the humanoid robot. This test is conducted in a room with three conditions of illumination: dark, dim, and bright. This test aimed to determine whether the humanoid robot can detect or recognize the faces and emotions of someone around it with input in the form of face and emotion detection results from the VGG16 model (model B) and  the proposed model (model C), which are stored during the training process with 500 training epochs.
The recognized data, coordinates, and distance are sent using a robot operating system (ROS). Such data is needed later for the voice recognition and movement systems of the robot.
The system testing results using the VGG16 model (model B) with 500 epochs can be seen in Table 4.
In tests 1 to 10, the movement of the humanoid robot in looking for faces and then detecting and recognizing faces is still not stable because the camera position must continuously move to follow the face position of the person to be recognized. However, the humanoid robot can still detect and recognize the person in front of it. Face recognition using model B is good because the system can recognize the faces of people in front of it, but there are still errors in recognizing emotions. In the 2nd, 5th and 9th tests, there are still system errors in detecting and recognizing emotions. This is due to several factors, such as poor or dim lighting conditions, the distance between the robot and the person to be recognized being quite far, the face not being positioned toward the camera and the lack of variation in training data from various possible conditions that exist during real-time testing on the humanoid robot. Additionally, in model B, the training process time is quite long due to the large number of layers in model B. The percentage of successes in recognizing faces using model B for all 30 students was 100% and that for emotion recognition was 73%.
The table also shows that the study can measure the distance between the recognized object and the position of the robot well, with an average error rate of 2.52%. This error may be caused by the changing position of the object, causing the distance measurement accuracy to decrease. The results of system testing using the proposed model (model C) with 500 epochs can be seen in Table 5.
Similar to testing using the VGG16 model (model B), in tests 1 to 11, the movement of the humanoid robot in finding faces and then detecting and recognizing faces is still not stable because the camera position must continuously move to follow the position of the face of the person to be recognized. Even so, the humanoid robot can still detect and recognize the person in front of it. In the 2nd, 3rd, 7th, 8th, and 10th tests, there are still system errors in detecting and recognizing emotions, and in the 3rd, 5th and 6th tests, there are still system errors in detecting and recognizing faces. This can be due to several factors, such as poor or dim lighting conditions, the distance between the robot and the person to be recognized being quite far, and a lack of variation in training data from various possible conditions that exist during real-time testing on humanoid robots. Overall, the success percentages for recognizing 30 students' faces and emotions using model C were 87% and 67%, respectively.
The test results obtained using model B and model C in real terms show that the implementation of face and emotion recognition in a CNN-based humanoid robot with model B and model C architectures is successfully carried out using system input in the form of point coordinates of the detected and recognized face to move the servo and dot matrix with 100% accuracy for facial recognition and 73% for emotion recognition in model B. Additionally, the accuracy obtained by the proposed model (model C) is 87% and 67% for face and emotion recognition, respectively. Although model B performs better in this study, the training process time is much longer than that for model C due to the number of layers used. However, the movement of the servo can also affect the accuracy of real-time recognition, and lighting is also very important during real-time recognition.
The real-time experiments shown in Tables 4 and 5 also show that this study can calculate the distance between the robot and object well. The distance and illumination may influence the accuracy of recognizing faces and emotions. The VGG16 model and the proposed model are quite good for recognizing faces in bright, dim or dark rooms. However, the emotion is rather difficult to recognize, especially for the dark and dim room. Additionally, a farther distance between the object and the robot may cause difficulty in recognizing faces and emotions.

IV. CONCLUSION
Based on this research, it can be concluded that the VGG16 model (model B) is superior to model C and model A, as shown by the success rates in detecting and recognizing faces and emotions of 100% and 73%, respectively. The smallest average loss of face and emotion recognition for VGG16 is 0.011 and 0.1875, respectively. However, the training process carried out by model B is much longer than that of model A and model C because the number of layers used is much larger than that in models A and C. The face and emotion recognition results obtained with model C are not much different from those obtained with model B and model C. Model C has the advantage of a faster training process because of fewer parameters.
Models B and C were then used as inputs to the humanoid robot's face and emotion recognition system. This study revealed that the implementation and development of a CNN with the VGG16 architecture (model B) and a modified AlexNet model (model C) for recognizing faces and emotions as input to a humanoid robot was successful. This was shown by the accuracy values obtained, with success percentages of 100% and 73% for face and emotion recognition, respectively, when applying the VGG16 architecture. In model C, the corresponding accuracies were 87% and 67%. In addition, the face and emotion recognition processes carried out in this study were combined into one recognition framework.
This study also showed that recognition can be implemented in real time for humanoid robots and that the distance between the object and the humanoid robot can be measured well. The distance and illumination are also important factors in recognition. Thus, it is necessary to upgrade the size of the training dataset while considering illumination in future studies. In addition, the proposed system still needs to be repaired and upgraded in future work.