Facial Emotion Recognition for Mobile Devices: A Practical Review

Communicating via email or various chat applications on smartphones is part of most people’s daily lives. But in written form, human communication loses a lot of valuable information, such as the facial expressions and emotions of the person you are communicating with. Thanks to techniques from the field of image processing, it is now possible to capture these non-verbal phenomena, and supplement written input with their non-verbal characteristics. In this paper, we explore the possibilities of emotion recognition from front camera images in mobile and embedded devices. A total of 63 classification and 28 regression models based on twelve different neural network architectures optimized for low performance mobile devices were trained and evaluated for success rate and latency. The training and evaluation of each neural network model is performed within the Keras API of the TensorFlow library and then converted to the TensorFlow Lite standard to reduce memory and computational requirements. Great care is taken to ensure that the entire process, from face detection to emotion classification, can operate in real time. To demonstrate and compare the performance of the evaluated models, a freely available optimized application running on Android mobile devices is created and published on Google Play, the source code of which is also available.


I. INTRODUCTION
Communication via short messages, email or various chat applications on smartphones is part of most people's daily lives.However, human communication loses a lot of valuable information in its written form, such as the facial expressions and emotions of the person you are communicating with.
The universality of facial expressions of emotion is one of the still debated issues in the biological and social sciences [1], [2], [3], [4], [5], [6].Darwin's universality hypothesis [7] states that all humans communicate the six basic emotional states (joy, surprise, fear, disgust, anger and sadness) through the same facial movements based on their biological and evolutionary origins.
These non-verbal signals can accentuate the meaning of verbal messages (through accompanying gestures and The associate editor coordinating the review of this manuscript and approving it for publication was Zhengqing Yun .grimaces), complement them, but they can also completely change the meaning of what is being communicated (e.g., an ironic grin accompanied by the comment ''You did it really well'' versus the same sentence spoken with genuine enthusiasm).It also determines much of how we perceive explicit verbal messages -it has been demonstrated that decoding the meaning of a message varies with facial expression [8].A now-classic study of non-verbal behavior by Mehrabian [9] and later studies with the contribution of Ferris [10] showed that attitude toward a stranger who said ''maybe'' was approximately 1.5 times more influenced by the stranger's facial expression than by the tone of his voice.
The question we can ask is whether the representation of emotions using emoji [11] in computer-mediated communication is an appropriate form of informing about the speaker's attitude.Studies [11], [12], [13] have shown that there is no indication that computer-mediated communication is less emotional or less personally engaging than face-to-face communication.On the contrary, emotional communication online and offline is surprisingly similar, and when differences are found, they unexpectedly point to more frequent and explicit communication of emotion in computer-mediated communication [14].A research [15] aimed at evaluating emotional responses to facial emoji using physiological and self-assessment measures showed that participants' emotional experience was consistent with the emotions expressed by facial emoji.No gender differences were found, and overall the results suggest that emoji are able to elicit particularly pleasant affective states.In exaggeration, we can say that emojis are the modern form of Facial action coding system [16].
Mobile and wearable devices can act as emotion sensors thanks to their integrated cameras and sensors for various physical variables.They have enough power to evaluate and interpret the sensed values as one of the emotional state [17], [18] and encode it as an appropriate emoji icon.Mobile phone applications can take advantage of the combination of facial recognition from a device's camera and various built-in or Bluetooth connected sensors measuring EEG, heart rate, respiration, body temperature and movement [19], [20], [21].Facial emotion recognition and its presentation in form of emoji can be used to observe the behavior of mobile phone game players [22], drivers in autonomous cars [23], [24], in enhanced chat application [25] or to create an emotion-aware mobile applications for autistic children [21], [26].
Recently published papers [27], [28], [29], [30] focused on facial expression and emotion recognition have presented their results on specialized mobile platforms (often single board computers, typically Raspberry Pi).Although the authors test their solutions on these embedded devices, we hardly ever see real tests on the most common type of mobile devices -mobile phones.Published papers rarely contain results from practical deployment and operation on most common types of phones.
The main purpose of this paper is to address this, review and compare the most widely used methods for facial emotion recognition and test these methods on several types of the common mobile devices/phones (and Raspberry Pi for reference).The presented results evaluate dozens of different emotion recognition methods [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42] following the same methodology and compare them with several commonly available mobile phones.The results provide a comprehensive overview of the field, comparing neural network parameters, sizes and latencies on the mobile devices.

II. NEURAL NETWORKS FOR MOBILE DEVICES
The simplicity of the architecture is essential for the use of neural networks on mobile phones or microcomputers.Neither the amount of available memory nor the amount of computational power can be expected to be comparable with desktop computers, and GPU acceleration is not the rule either.Over the years, many approaches have been devised, most of which have the following architectures in common.Probably the most fundamental feature is the use of 1 × 1 convolution, introduced in [43].It is itself relatively computationally inexpensive, but it makes subsequent passage through other convolutional filters and feature extractors many times easier with only a slight reduction in detection quality and accuracy.Another desirable feature of the architecture is the ability to scale the number of trainable parameters by changing the number of layers of the network or its width.This way we can adapt the architecture to the specific performance of the hardware.Typical activation functions are usually different variants of ReLU [44], be it Leaky ReLU [45], ReLU6 [31] or Hard Swish [33], which are semi-linear and simple to compute.
The following architectures were selected for our review because they theoretically allow sufficient scalability for use on mobile devices or because they are designed specifically for such devices.They also represent most wellknown architectures; almost all are directly available in the TensorFlow Keras API.We want to achieve latency suitable for real-time use, so we set a latency threshold of 150 milliseconds for image processing by a particular model.Memory requirements do not play such a critical role, but the smaller the model is, the better.

A. MOBILENET
MobileNet neural network models [31], [32], [33] are renowned solutions for mobile applications and have been designed for the highest efficiency of operations.MobileNet type networks are successfully used in many applications [46], [47], [48] for facial emotion detection.They belong to convolutional neural networks, but convolution is computed in a smart way.The method is called depthwise separable convolution, and splits classical convolution into two steps: the depth-wise convolution, and the point-wise convolution [31].
In TensorFlow [49], the MobileNet neural network model can be built very easily through the Keras API by simply calling the appropriate method and specifying a few network parameters such as the input image size, number of color channels, weights, or the number of desired object classes we would like to distinguish [50].Another important parameter of the model is alpha, which can be used to control the number of trainable network parameters.Furthermore, we can choose between different versions of models from the MobileNet family.The specific versions are MobileNetV1 [31] (2017), MobileNetV2 [32] (2018), MobileNetV3Small, and MobileNetV3Large [33] (2019).Google has released the MobileNetEdgeTPU model [51] (2019) with optimization on special TPU computing units.In the mobile version, this chip architecture is used in the Google Pixel 4 series and newer phones for hardware acceleration of neural networks and AI runtime.
15736 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
MobileNetV2 is based on the principles of MobileNetV1, but additionally extends its architecture to include inverse residue blocks with a convolution function, which allows combining results from activation functions and results that have passed the convolution block filter [32].
The latest version of MobileNetV3 [33] is declaring up to half lower latency compared to the previous generation, while increasing classification accuracy by an order of percent [50] on the Google Pixel 1 phone with ImageNet dataset weights.Compared to the second generation, it brings a number of improvements.It comes with a new block type Squeeze-and-Excitation (SE) that better take into account feature maps based on their channel dependencies.Also, instead of the ReLU activation function, there is a Hard-swish function, which reduces the number of multiply-accumulate operations (MAC) but preserves nonlinearity [52].It also comes separately in a version for more powerful and weaker target devices, and both versions can additionally be made minimalist or full.With meaningful settings, the MobileNetV3 model can have 2 to 5.4 million trainable parameters.

B. NASNET
The abbreviation NAS stands for ''Neural Architecture Search'', a reference to the way the network in the original implementation adjusts its own architecture based on overall latency, success rate, and dataset size [34].The way the network is built also changes the way the network is trained -instead of looking for the overall most successful network, the focus is on the best adaptation of a small convolutional cell to a given problem.NASNet has found its way in driver emotion recognition [53], [54], among others.
NASNet was created in three variants, labeled A, B, and C. The resulting architecture is based on the results of optimizing classification accuracy on the ImageNet dataset [55].Keeping the same number of trainable parameters, the highest success rate on the CIFAR-10 [56] and ImageNet datasets comes out for type A, which is also implemented in TensorFlow.However, instead of being able to scale the architecture using parameters, the TensorFlow developers decided to make the models available only in the smallest and largest configurations and called them NASNetMobile and NASNetLarge.The number of trainable parameters in the mobile version is 5.3 million and in the full version it is 88.9 million parameters.

C. SQUEEZENET
Another mobile architecture was developed at a time when neural networks were deepening and the primary goal was to increase success rates.SqueezeNet [35] (2016) retained accuracy comparable to AlexNet [57], but using only about one-fiftieth of the parameters.With one and a quarter million trainable parameters, the authors declare a model size of only 0.5 MB, which is on the order of one-hundredth of that of AlexNet.Also, the training speed is many times faster.
SqueezeNet applications for emotion classification focus mainly on a different data source, such as EEG [58], [59], but there are also applications that use facial images [60].
The innovation concerns in particular the so-called Fire module, which consists of a squeeze and an expansion layer.In the first layer, a 1 × 1 × K convolution is applied to the input of K channels, so that the output of the channel has only one channel.The second layer performs the expansion to the desired number of channels from the total fire block output, using two sets of 1 × 1 and 3 × 3 convolutions with their subsequent concatenation.Fire blocks, in combination with residual connections between them, allow to extract even hard-to-find features from the image.

D. EFFICIENTNET
EfficientNet [36] is a family of convolutional neural networks designed by a team from Google Brain (2019).The team set out to create a network architecture that is as efficient as possible in its operation, and that can easily adapt its width and depth to a given problem as required.EfficientNet combines key techniques of the MobileNet and SqueezeNet architectures to extract a high number of features of interest from images with a low number of parameters.EfficientNet is successfully used for emotion classification in a video sequence processing library designed for mobile devices [61], [62].
In addition to the original architecture, EfficientNet has also been released in a revised version, EfficientNetV2 [37] (2021), which implements the so-called adaptive gradient clipping (AGC) instead of the traditional batch normalization (BN).This technique should provide a much smoother convergence of success rates during training, while being less computationally intensive [37].However, it cannot be generally claimed that AGC achieves better success rates for the same number of training epochs as BN; the results depend on the specific network.
Because this architecture is designed from the ground up to be highly scalable, the authors have provided eight optimized configurations in the first generation.The smallest, denoted EfficientNetB0, has just over 5 million trainable parameters and size about 20 MB.The smallest second-generation EfficientNet model has about 2 million more parameters and is about 10 MB larger.The largest model, EfficientNetB7, already has almost 67 million parameters and 250 MB in size [36].In the second generation, this is matched by EfficientNetV2L, which has doubled the number of parameters and size.

E. SHUFFLENET
Researchers at the University of Hong Kong are behind another innovative neural network called ShuffleNet [38] (2017).The word shuffle in this case refers to one of the key layers of the network.In the residual block, after depth convolution (over all channels of the tensor), the channels of the feature map are swapped before being followed up with a point convolution.The channels are divided into several groups, the groups are swapped, and the groups are transformed back into channels of the original dimension. 1 By swapping the internal channel structure within the residual block, the network learns to perceive the relationships between the channel groups in the feature map, and the number of parameters can be reduced while maintaining the same classification success rate.At the same time, swapping channel groups is a relatively inexpensive operation, making the network traversal less computationally intensive.
To maximize the success of ShuffleNet detection, it is desirable to divide the channels into a higher number of groups.However, since the convolutions are applied just to the groups, this increases the overall network fragmentation and worsens parallelization (especially on GPUs).So the authors of the original architecture came up with a solution where they included a new type of block called shuffle block before each channel split, which has two parallel branches, the input tensor is split between them and concatenated at the end.The first branch leaves the input unchanged and the second branch applies the convolutions 1×1, 3×3 and 1×1 to it in that order (the input dimension is preserved).In contrast to the convolutions in the shuffle block, here they are applied to each channel separately, which makes the parallelization of the necessary computations easier and more efficient, while at the same time the number of shuffle blocks to obtain the features can be reduced.This improved architecture has been named ShuffleNetV2 [39] (2018).MiniShuffleNet V2 also occurs in the processing pipeline for realtime face-detection and emotion recognition [63].

F. DENSENET
DenseNet (2017) is a group of deep neural network architectures that differ precisely in their depth.The version with 121 layers is the shallowest and thus the most suitable for use on a mobile device; other implementations have 169, 201 and 264 layers [40].
DenseNet is often compared to the conceptually similar ResNet [64] architecture, which deals with the problem of vanishing gradient by using residual blocks.However, ResNet will not be discussed in this paper.Initial practical tests have shown that even in the smallest configuration with fifty layers (ResNet50), it is not able to compete with others.Detection accuracy was only average and latency was many times higher for ResNet than for other networks, which is essential for mobile devices.According to these criteria, even the DenseNet121 network cannot be considered a fully mobile architecture.Due to its greater complexity, DenseNet is mainly used for emotion classification with less extensive inputs, e.g., EEG data [65], [66]. 1 Example: From a feature tensor of dimension 56 × 56 × 24, the 24 channels are divided into two groups of 12 channels each, resulting in a tensor of dimension 56 × 56 × 2 × 12.The groups swap order and are reshaped and merged back into the dimension of the original channel.

G. GHOSTNET
The authors of the GhostNet architecture [41] (2019) came up with an interesting finding based on their research into neural network architectures at the time.They found that very often the same or very similar feature maps are generated multiple times from images, which the authors of the architecture found to be inefficient.They proposed a so-called solution ghost module, which works by splitting the input data into two parts -the first part is processed by standard convolutions, as in common architectures, but by using fewer parameters.The second part of the module creates other so-called ghost maps from these maps in a process that could be likened to augmentation.Using simple operators with linear computational complexity, the feature maps are transformed to produce more, similar maps.Subsequently, the feature maps produced by the convolution pass and the ghost maps are concatenated together into a single tensor.In this way, the authors have streamlined the process of creating very similar symptom maps to each other and thus reducing the computational demands of the network while maintaining comparable success rates.The design of this network is suitable for multimodal emotion recognition [67].

H. MNASNET
In a similar way to NASNet, the architecture of MnasNet [42] was created, but from the beginning it was designed primarily for use in mobile devices.It takes inspiration from many optimization techniques from other mobile architectures, in particular MobileNet.In designing this architecture, the authors have drawn on Google's insights on feedback learning from the Google AutoML project, which aims to automate the design of machine learning models.In practice, this approach is more applicable to smaller networks and medium sized datasets [68], but automated network designs perform 15738 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.well even compared to those designed by human experts.The network is suitable for emotion recognition [69].
The particular form of MnasNet was created using AutoML by continually training and measuring success on automatically generated architectures.In addition to the success rate, the real latency on Pixel phones was also evaluated, and the resulting network design was therefore a compromise of the aforementioned properties of [70].The resulting architecture has, according to the authors, one-third lower latency than MobileNetV2 and almost two-thirds lower latency than NASNet with comparable success rates.While all of the previous networks mentioned are designed for use on portable and low-power devices, MnasNet is the only network designed primarily for use on phone hardware.

III. DATASETS
Choosing an appropriate dataset for training the network is an essential step in creating a successful neural network.The ideal dataset should contain as much diverse data as possible, but still stay within the desired categories, and should reflect as closely as possible what the network will then encounter during real deployment.Furthermore, it should be checked that the categorization of the data actually matches its real nature, so that the network is not unnecessarily confused by inconsistencies during training.Ideally, we are looking for a large dataset of photos that match the outputs of the face detectors, i.e., only face cutouts from the chin to the forehead and to the roots of the ears with minimal background.

A. FER2013
FER2013 [71] was created based on Google search results and is one of the first ever emotion datasets, which is why it is still widely used for facial emotion recognition.It contains 35,887 black-and-white images of 48×48 pixels, each labeled with one of seven emotion categories (anger, disgust, fear, happiness, sadness, surprise, and neutral).
The distribution of categories in the dataset is not very balanced, with almost half of the images with neutral or happy faces combined, and disgust captured in only 547 images.On closer examination, we find that there are also images in the dataset that contain only a face drawing or do not even contain a face (see Figure 2).FER2013 is often criticized for the poor quality of the images and the significant number of annotation errors.Despite this, success rates on this dataset are still cited in publications as an additional reference.
In 2016, Microsoft released a revision called FER+ [72].It partially removed the problematic images and completely changed the annotation.There are a total of 12 categories, created by combining the two original ones (happy surprise, sad fear, angry disgust, and so on).This has made it possible to improve the distribution of categories across the across the dataset, while maintaining the quality of the images.

B. RAF-DB
When creating this dataset, the authors focused on selecting photos with the greatest variety in many aspects.The wide range of ages and races of the people captured, as well as the backgrounds, lighting, and quality of the photos, were used to try to approximate real-life conditions as closely as possible.The images were collected from various search engines and social networks and annotated by a group of forty people [73].
The dataset contains about thirty thousand images divided into two groups according to the annotation method.In the first group, the images are divided into seven categories, as is the case for FER2013, and the second group contains twelve categories of combined emotions, as is the case for FER+.The dataset is freely available for non-commercial use on the publication's website and records dozens of citations each year.

C. SFEW, EXPW AND CK+
The extended Cohn-Kanade dataset (CK+) is a collection of 593 short video sequences of 123 people of different ages.A facial change from neutral to one of seven other emotions is always captured.The emotions are expressed both purposefully and spontaneously [74].The representation of race and gender is very diverse, which was also the goal.The main difference from other datasets, however, is the very laboratory/artificial setting of the sampling.Each subject is always facing the camera with a white background behind.It is therefore more of a data collection for mimic muscle analysis.The amount of images that can realistically be used for the purpose of this paper is very small, on the other hand the quality is high.
Similarly to RAF-DB, the datasets Static Facial Expression in the Wild (SFEW) and Expression in the Wild (ExpW) try to capture the face as realistically as possible.In both cases, however, the images contain a large number of redundant areas that do not contain a face, so to train the neural networks, the face would first need to be found in the images and cut out at a uniform resolution, only after this modification could the datasets be used.

D. AFFECTNET
By far the largest and, at the time of writing, by far the most cited facial emotion dataset is AffectNet [75].Three kinds of annotations are created for each image: membership in one of eight categories of emotion, two values describing the degree of arousal and valence of a person's emotion (see Russell's Emotion Diagram 2 [76]) and an array of 68 coordinates of points representing points of interest on the face.It contains roughly 440,000 human-annotated images, of which less than 300,000 are available for query; the other 550,000 were annotated by the ResNet neural network learned on the previous set of images with a reported 65 % success rate in categorical detection.
The annotation was performed independently by two people with an overall agreement rate of 60.7 %, which is a relatively low agreement rate, but the categories of no face, uncertain expression, and no emotion are included in this figure.However, since these three categories are not present in the resulting distributed dataset, after excluding them, the agreement rate is almost 66 %.The agreement valence value expressed as the root mean square error is 0.340 and 0.362 for arousal.In addition, note that the square root mean square error does not take into account the sign (the numbers 3 and 5 are as far apart as -1 and 1, but the error in the wrong sign is not negligible in this case).These examples show how subjective human emotions are and that they cannot always be clearly categorized.
The dataset is useful for training both the detection of facial contours in an image, but also the recognition of emotion within fixed categorical boundaries or an emotion diagram.The images are focused exclusively on the face region and at a constant resolution of 224 × 224.Due to the interplay of all the aforementioned aspects, AffectNet was selected as the default dataset for use in our evaluation and demonstration implementation.
The authors of a companion publication to the dataset also published reference success rates for categorical emotion detection and RMSE for valence and arousal values.For this, AlexNet was used with a weighted loss function highly penalizing the misidentification of the disgust, contempt, and fear classes because they are the least represented in the dataset.In the categorization across the entire dataset, the highest determined success rate was 58 % and the RMSE values for valence and arousal were 0.37 and 0.41, respectively.These numbers roughly correspond to the agreement of the annotators.

IV. EVALUATION IMPLEMENTATION
To improve the ability of the neural network to generalize the features of interests, we perform augmentation.After each training epoch, we randomly apply some combination of image modifications to each individual image.Since we plan to use the neural network to classify images from the 2 Russell's Emotion Diagram -the vertical axis plots the intensity of the emotion (arousal) and the horizontal axis plots the positivity of the emotion (valence).Moving from top to bottom on the Y-axis, we go from maximum arousal to complete calm, and on the X-axis we express negative experiences on the left and positive ones on the right.The basic idea is that although emotions are primarily divided into four quadrants, when their representative points are close together on the diagram, they share many key characteristics.This emotional model has become widely accepted in psychology and beyond.front camera of the phone, which tends to be of poorer quality, it is logical to add some level of noise or blur to the dataset, change the contrast or saturation level, or slightly rotate the image.Other possible adjustments such as mirroring, skewing or cropping can be applied to avoid overfitting the neural network.
The augmentation of the AffectNet dataset for the purposes of this paper was mediated by the Python library imgaug [77].The operators were set so that after each epoch, one of the available transformations is randomly applied to each frame of the dataset with a probability of 10 %.In the case of blur, Gaussian Blur and Median Blur were applied with probability 2 %.Each of the transformations can be applied in random order independently of the others, but always at most once per image.
All the above mentioned models were trained for 25 epochs, after each epoch the success of the network was measured and the models that achieved the highest success rate throughout the training are shown in this comparison.It is certainly worth noting the variation in success rates with different batch sizes, depending on the architecture, a batch of 8 or 16 seems to be the best, with a decrease in success rate as the size is increased further.For the MobileNetV2 architecture, batch 16 was also tested, but the success rates are almost identical to batch 32.
In order to maximize the potential of each architecture, it is necessary to adapt its settings to the problem at hand.Many networks have variable widths and depths, and the number of trainable parameters as well as their memory and computational requirements are directly related to this.This work focuses on the use on mobile devices and the number of 10 million parameters proved to be quite limiting during testing; on an average phone, even with good optimization, such a network cannot operate in real time (a latency of up to 150 ms can be considered acceptable).For a better idea, this corresponds to a model size of about 80 MB, which is about 15 MB after conversion to TensorFlow Lite and optimization.
The conversion of the model from TensorFlow to the TensorFlow Lite standard is absolutely necessary to incorporate the neural network into the testing application running on Android.For one thing, there is currently no current TensorFlow API for Java [49], but also TensorFlow models in the standard format are not at all optimized to run on a mobile phone.Converting to TensorFlow Lite is lossy due to its optimization techniques (the converted models will not have the same success rate as the original ones), but with reasonable settings the difference is quite negligible, while the size of the optimized model is fractional.The average size saving of the models used in this text is around 77 %.
Since the original models are acceptably large for use on mobile, there is no need to focus the optimization on size reduction, and since the architectures used are designed to be compact and efficient, there is therefore no need to significantly target latency.For this reason, we chose a default optimization setting at conversion that represents a tradeoff.
In most architectures there is usually a fixed default input image size (most often 224 × 224 × 3), but some are able to work with general size images if no size is specified in the training settings, for example MobileNetV3.However, such a model cannot be directly converted to TensorFlow Lite, since the generic dimension is replaced by dimension equal to 1 with no possibility to override the value.However, if we would like to convert such a model, it can be done indirectly.Using TensorFlow, we create an identical model (this time with the correct input image dimension), load the weights from an existing model, and compile and save the new one.We have verified that the model created in this way has exactly the same success rate as the original one, and the conversion to TensorFlow Lite is then possible.
The application and source codes with all converted and trained models is freely available on Google Play and can be downloaded to evaluate the methods discussed here.The Figure 1 shows samples from the running application demonstrating the classification of emotions of the user's chosen model.Links to the application and the repository are provided in the SUMMARY of the paper.

V. EVALUATION RESULTS
In total, 63 classification and 28 regression models were trained based on twelve different neural network architectures.The models differed in their internal parameters, where the implementation allowed, and in the training setup.All models were evaluated already during the initial training after each epoch, and always the most successful one was then converted to the TensorFlow Lite format and retested.For demonstration purposes our mobile application includes all the models presented here and gives the reader the opportunity to try the algorithms on their own mobile phone.
The following Tables 1 and 2 show the ten best performing models among the regressors and the twenty among the classifiers.The tables contain the parametric settings used for the network itself as well as the training settings.In the last two columns, the RMSE or the percentage success rate achieved by the model before and after optimization and conversion to TensorFlow Lite is given.
To shorten the descriptions of the network parameters, the following abbreviations are used in the tables: A = Alpha; D = Depth; CH = Channels; SF = Scale Factor; B = Bottleneck; C = Compression; MINI = Minimalistic; LR = Learning Rate; TF = TensorFlow; TFLite = TensorFlow Lite.
By far the most successful is undoubtedly MnasNet, which appears here with different settings several times, closely followed by different versions of ShuffleNet.The differences between the achieved RMSE values are minimal, and therefore all models in the table can be considered pretty much equivalent in terms of their ability to determine emotions.Another interesting fact that can be gleaned from the Table 1 is that no model experienced a change in the average RMSE value on the test dataset due to optimization and conversion to TensorFlow Lite (the values differ negligibly).Figure 4 shows the RMSE values with  respect to the model sizes.As we can see, there is no significant evidence that model size is a key success factor.
Even as a classifier, MnasNet was among the best with a slightly bigger difference than the others in Table 2. ShuffleNet, on the other hand, failed to train to such a success rate and instead EfficientNet is abundantly represented here.As in the previous table, the use of a batch size of 8 appears to be the most optimal for most networks.Further, we see representation of the Adam optimizer with slightly greater frequency, but it cannot be concluded that it is superior.In the case of the classifiers, we can already notice slight differences in success rates after converting to TensorFlow Lite, even for some models the success rate paradoxically increased slightly.In Figure 3 we can see that the success rate of the network essentially increases with the size of the model.
The AlexNet network RMSE reference trained by the authors of the AffectNet dataset was 0.37 for valence and 0.41 for arousal.For the best MnasNet model generated in the context of this paper, these values are 0.34 and 0.42, making it a slightly better model on average, but with less computational effort and significantly smaller size compared to AlexNet.The benchmark success rate of AlexNet as a classifier was 58 % in the original AffectNet paper, and this paper achieved the highest success rate of about 57 %, also with the MnasNet model.In both categories, the results obtained are comparable to the benchmarks, but rather than the excellence of the solution itself, we give most weight here to the limitations of the AffectNet dataset annotation, which is very inconsistent, especially in the case of valence and arousal values.
At the time of writing this paper, the highest classification success rate achieved on the AffectNet dataset was 63.03 % [78] with the EfficientNetB2 model, but this is definitely not suitable for use on a mobile device.The best RMSE values on the AffectNet dataset were achieved with the VGG-Face [79] network, specifically the authors report valence and arousal values of 0.356 and 0.327 [80], respectively.Unfortunately, they do not state in their paper whether these values were achieved within the output of a single model or whether multiple independent models were trained, which seems more likely.Despite this, the valence value achieved by the MnasNet model in this paper is better.
The two tables 3 and 4 list the sizes and measured average latency of TensorFlow Lite models on devices of these specifications:   is not desirable to overload the phone's system resources unnecessarily when the demonstration application is running.Differences in model sizes relative to the sizes of today's average applications do not play an important role.All latencies are measured without using TensorFlow Lite delegates, as these are not compatible with some architectures.At the same time, delegates are not supported on all device types, so the most general configuration available on all devices is measured.For faster orientation in the measured data we represented the values of both tables by scatterplots in Figures 5 and 6.
DenseNet121 is the worst in terms of latency and can be excluded from any further comparisons.The same is true for the largest configurations of MnasNet (Alpha = 1.5),ShuffleNet (Channels = 200) and ShuffleNetV2 (Scale Factor = 1.5), which still have too high latency on less powerful devices.The other models can be considered usable, but of course the lower the latency the better.From the Table 1 we know that very good RMSE results were achieved by MnasNet A = 1.0 D = 1, EfficientNetB0 and all MobileNets, so with respect to acceptable latency values we can consider them the best overall.
Even among the classification models, DenseNet121 is again the worst model due to its huge latency and is thus almost unusable for mobile devices.The highest success rate among the classifiers is achieved by MnasNet A = 1.5 D = 3, but it is too demanding, the other MnasNet configurations are nevertheless usable.Unfortunately, the EfficientNet models are a big disappointment even in the smallest configurations; for real-time image analysis on a phone, the latency between 200 and 300 milliseconds is still very high.GhostNet achieved the highest classification success rate of about 54.7 %, which combined with a latency of about 100 milliseconds on a low-performance device is a surprisingly good result.Although the MobileNet architecture models achieve rather average results in emotion classification (MobileNetV2 54.8 %, MobileNetV3Small 53.9 % and MobileNetV3Large 54.5 %), their latency is indeed very low even in the largest configurations.NASNet-Mobile achieves uncompetitive results in both success rate and latency, and there is no point in pursuing it further.The ShuffleNet models in both versions achieve relatively low success rates (around 50 %), and while increasing the network by parameters does increase the success rate, it has too high FIGURE 6. Performance of regression models.Graphs showing the characteristics of the models with respect to latency and RMSE values (the smaller the RMSE value, the better the model).The best models are from Mobilenet, EfficientNet and MnasNet groups.We can regard MnasNet as the best because we do not consider the differences in latency to be that significant and MnasNet achieved the lowest RMSE values.an impact on latency, and therefore no ShuffleNet model is applicable for real deployment on a mobile device.A pleasant surprise is SqueezeNet, which achieves a success rate of about 54.4 %, has very low phone hardware requirements in all respects, and is also the model with the absolutely shortest training time.
To summarize the above results, the best models with respect to latency and RMSE values achieved are MobileNetV3Large MINI, MobileNetV2, EfficientNetB0 and MnasNet A = 1.0 D = 1.The problem arises more in the general use of regression models to determine valence and arousal values with training on the AffectNet dataset.This is because the values from the test dataset do not quite match reality.
For the test application, it turns out that it is really not easy to reach a negative value of arousal and thus to be in the lower half on the Russell diagram.Thus, the calmest detectable emotion is only neutral, yet the test dataset contains fairly uniformly distributed values across the entire diagram.Although the detections in the upper half of the diagram are very accurate, we preferred to use a classifier for real deployment.
Among the classifiers, GhostNet, MnasNet A = 1.0 D = 1, SqueezeNet C = 1.0 and MobileNetV2 achieved the highest success rates.The SqueezeNet model excels only in size; if our primary limit were model size, it has no competition, but we consider latency and success rate to be more important.In terms of success rate, the three remaining models are almost identical, which shows that they rank right behind each other in Table 2.However, the difference in latency of over 10 ms for the less powerful devices is in MobileNet's favour.If we consider only latency on powerful devices (like Pixel 6), the difference with MnasNet is negligible.So, also in this case, it very much depends on the specific target application.All three models mentioned above are supported by TensorFlow Lite GPU Delegate, and with such a setup their resulting latency is reduced by up to 65 % on low-performance phones and up to half on Pixel 6.

VI. SUMMARY
In recent years, we have witnessed a great development of neural networks, and the field of emotion recognition using neural networks has not been left out.This paper offers a summary of approaches related to facial emotion classification, which are widely used in practice, especially in the mobile application segment.We have attempted to demonstrate their capabilities on specific mobile devices commonly encountered and provide a benchmark.The aim of this paper is not to compare the different models in their generality, but on one specific, widely used case -facial emotion classification.
In contrast to individual articles, which are mostly focused on one particular approach, our aim was to compare each method from a practical point of view, which is of most interest to mobile application developers, among others.We focused on success rate, latency and model size, key parameters for practical deployment.The best models with respect to latency and RMSE values achieved in our evaluation were MobileNetV3Large MINI, MobileNetV2, EfficientNetB0 and MnasNet.However, a clear determination of the best model depends on the specific needs of the application; for the expected use, it would probably be MnasNet, since we do not consider the differences in latency to be that significant and MnasNet achieved the lowest RMSE values.
We implemented and trained all models on popular datasets and adapted them to run on Android mobile phones.Based on the feedback from users of our mobile app, we can say that the MobileNetV3 and MnasNet models indeed performed subjectively the best in most of the configurations, which is consistent with their popularity among developers.We also highlight open problems in this area that may inspire new approaches to emotion recognition in mobile phones in the future.
The demonstration application developed for the purpose of this article can be freely downloaded from Google Play https://play.google.com/store/apps/details?id=cz.vsb.faceemotionrecognition.The source code is available on GitHub for readers' convenience, see https://github.com/VojtaMaiwald/FaceEmotionRecognitionTest.Please note that the application available on GitHub includes all models, while the application hosted on Google Play doesn't have some of the larger models available due to size restrictions under Google Play policy.No data is transferred to the Internet.All models are stored and run locally on the device, so there is no risk of leaking personal or otherwise sensitive data.
Also, the source code and all the trained models are publicly available, see https://github.com/VojtaMaiwald/Diploma.All latency, success rate and RMSE measurements for both standard TensorFlow models and models converted to TensorFlow Lite are also available there.The repository also contains source code for working with the AffectNet dataset, image augmentation, model training and testing, and implementations of neural network architectures not available in the TensorFlow Keras API.

FIGURE 1 .
FIGURE 1. Demonstration application for Android.The left image shows the results of the classification model (MnasNet in this particular case), while the right image shows the outputs from the regression model.The results are displayed at the top of the captured video.

FIGURE 2 .
FIGURE 2. Sample of inappropriate images from the FER dataset.It contains a large number of drawings, badly cropped faces, watermark overlays or the face is not even present in the photo.It can also be seen that the original image resolution of only 48 × 48 is really low and many facial details are lost because of this.

FIGURE 3 .
FIGURE 3. Classification models success rate with respect to the size of the models.Intuitively, we can see that the larger the model, the better the classification results.

FIGURE 4 .
FIGURE 4. Regression models RMSE with respect to the model size.The lower the RMSE, the better a given model is.The contribution of model size to the quality of the results here is not obvious.Despite its size, DenseNet provides the worst results of the entire collection.

FIGURE 5 .
FIGURE 5. Performance of classifier models.Scatter plots showing the relationship between mobile device latency and success rate.Models are shown in different colors, with point size reflecting the size of the model.As we can see, MobileNet networks provide very good performance on all devices while maintaining a small footprint.However, if we consider only latency on powerful devices (like Pixel 6), the difference with MnasNet is negligible.

TABLE 1 .
Top ten regression models.

TABLE 2 .
Top twenty classification models.
Raspbian However, only models that differ in network architecture settings are listed, as training parameters (batch, number of epochs, optimizer, etc.) do not affect latency or model size.Table3lists the values for regression models and the Table4for classification models.We chose 150 milliseconds as the reference latency on the least powerful phone, as it Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 3 .
Latency of TensorFlow Lite regression models on Android phones.

TABLE 4 .
Latency of TensorFlow Lite classification models on Android phones.