Role of Zoning in Facial Expression Using Deep Learning

Facial expression is an unspoken message essential to collaboration and effective discourse. An inner emotional state of a human is expressed using facial expressions and is very effective for communication with actual emotions. Anger, happiness, sadness, contempt, surprise, fear, disgust, and neutral are eight common expressions of humans. Scientific community proposed several face emotion recognition techniques. However, due to fewer face landmarks and their intensity for deep learning models, performance improvement for facial expression recognition still needs to be improved for accurately predicting facial emotion recognition. This study proposes a zoning-based face expression recognition (ZFER) to locate more face landmarks to perceive deep face emotions indemnity through zoning. After face extraction, landmarks from the face, such as the eyes, eyebrows, nose, forehead, and mouth, are extracted. The second step is zoning each landmark into four regions and zone-based face landmarks are passed to the VGG-16 model to generate a feature map. Finally, the feature map is given as input to fully connected neural network (FCNN) to classify facial emotions into multiple classes. Various experiments are performed on facial expression recognition (FER) 2013 and CK+ datasets to evaluate our proposed model with state-of-the-art facial expression recognition approaches using performance assessment metrics like accuracy. The accuracy of the proposed method with face features on CK+ and FER2013 are 98.4% and 65%, respectively. The experimental zoning results improve from 98.47% to 98.74% on the CK+ dataset.


I. INTRODUCTION
Many advances have been made in the field of humancomputer interaction, such as vision-based security systems, self-driving cars, and smart applications based on artificial intelligence and predictive analysis [1], [2]. For communication, human beings usually use gesture, speech, and facial expression mediums [3]. Language is a verbal communication medium, and the quality of voice plays an essential role in this medium to sense emotion [4]. However, non-verbal facial expressions and gestures are the medium of communication. According to a study, 7% of communication between humans is communicated through Linguistic language (verbal parts), The associate editor coordinating the review of this manuscript and approving it for publication was Hossein Rahmani . 38% by paralanguage (vocal parts) and 55% by facial expressions [5]. Therefore, facial expressions are vital in perceiving information in face-to-face communication. The facial expressions define the state of emotion, which is very effective in interpreting human reactions and feelings [6]. Facial expression is an unspoken message. It is an essential part of the collaboration and appropriate information. An inner emotional state of a human is expressed using facial expressions and is very effective for communication with actual emotions. Charles Darwin developed the first conceptual framework for emotional expression. Similarly, a study by Ekman and Friesen enumerated eight basic expressions such as sadness, fear, happy, disgust, anger, and surprise [7]. Fig. 1 illustrates these eight basic facial expressions. VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Human perform faces recognition task in their daily lives. People can recognize those faces they have met before and easily differentiate them from strangers. This task is the basis for human cooperation, relationship and collaboration. In the last few decades, due to the advancement in computing power, more powerful computing systems have been available that can have the ability to process a massive amount of data. Image Processing and Computer Vision [8] is the field of Artificial Intelligence, where researchers analyze images and video data to extract patterns and invaluable information. Computer Vision has many application areas, such as human-computer interaction, biometric authentication, Medical Images, multimedia management, and surveillance [9]. Feedback related to user emotions provides valuable information that could positively impact various areas such as e-marketing, robotics, smart products, etc. Facial expression recognition is essential for any image processing application. The accuracy of expression recognition depends upon facial features. Therefore, facial features play a crucial role in accurately predicting the emotion recognition system. A sophisticated technique is required that extract the more relevant facial expression for accurate prediction of emotion recognition.
Face recognition techniques suffers from visual pattern recognition problem when a three-dimensional object is to be identified in the perspective of its two-dimensional image [10]. However, face recognition and facial emotion recognition tasks can be automatically performed using Image Processing and Computer Vision techniques. This area has made significant progress in achieving better face recognition models. The existing face recognition systems can achieve satisfactory results under certain constraints. However, face images suffer from several problems, such as head pose, blur, and illumination [11], that can affect the performance of the facial recognition system. Furthermore, recognizing facial expressions automatically with accuracy is a difficult job. There are two main difficulties with the emotion recognition system. First, the distinct expressions make it difficult to compare different persons. Second, some expressions are difficult to distinguish, which makes it more challenging to differentiate between the emotions of the same person.
Due to fewer facial features and their intensity for deep learning models, improving performance for facial expression detection is still challenging. Therefore, for an accurate prediction of facial emotions prediction. A model called ZFER is proposed to locate more face landmarks to perceive deep face emotions indemnity through zoning. First, face landmarks are extracted after face extraction, such as eyes, eyebrows, Nose, Forehead, and Mouth. In the second step, zoning is applied to get four regions of each landmark [12]. Lastly, zone-based face landmarks are provided to the VG16-16 model as the inputs to generate a features map. Finally, the FCNN is employed to classify facial emotions based on the general features map. To highlight the significance of the proposed research study, two benchmark datasets: FER 2013 and CK+, are used to evaluate our proposed model with state-of-the-art facial expression recognition approaches.
The main contributions of this study are as follows: • Face landmarks localization and feature extractions based on zoning.
• Extract the forehead as an extra feature for emotion recognition.
• Analyze the impact of zoning on facial features for the classification of emotions.
• Develop a robust hybrid VGG-16-DNN assisted facial emotions classification model to perform extensive experiments using benchmark FER 2013 and CK+ datasets.
• Finally, the impact of zoning is analyzed to highlight the effectiveness of the proposed VGG-16-DNN model. The rest of the paper has been organized as follows. Section II presents a comprehensive literature overview of the current research issues and further highlights the motivations for our contributions. Section III presents the proposed methodology. The results and discussion is presented in Section IV. The conclusion is presented in Section V.

II. RELATED WORK
Scientific community proposed various research methods for emotion recognition. This section presents overviews of the existing approaches proposed for the classification of emotions. First, we present facial expressions recognition based on the well-known FER datasets. The later part of related work presents facial expression recognition mechanisms based on deep learning paradigms, ensemble learning models and transfer learning. Lastly, we present an overview of the issues, challenges and research gap to show the significance of this study. Identification and recognition systems based on motion sensing data collected through accelerometer and gyroscope have been proposed using deep learning such as Deep CNN, SVM, and few-shot settings. Rich face representation learning using deep learning helps predict social relation traits using facial images. Introduction to this novel network architecture includes a bridging layer that controls the existing relationship between datasets. Moreover, it enables learning from different attribute sources and coping with missing target attribute labels. One of the extensions to this study includes exploring modelling relationships among more than two people, implemented via a graphical or voting model, where the face is denoted by a node and the relationship between nodes (faces) is an edge [13]. Deep Convolutional Network & Alignment Mapping Network overcome the issue of performance decline for the non-aligned facial features. The method utilizes both aligned and non-aligned features to perform FER on FER 2013 dataset resulting in improved performance [14]. VGG-like standard deep architecture is presented for a more precise definition of categorizing children's emotions related to learning using theoretical and psychological frameworks. It produces effective state prediction accuracy, including neutral and positive states for FER+ Dataset. For meaningful interpretation of information, the work is extended from multiple sources & working with multi-modal datasets [15]. Alenazy and Alqahtani proposed a semi-supervised deep belief network (DBN) and gravitational search algorithm (GSA) based approach [16]. This approach is used to predict seven classes of facial expressions, and each class has its associated 4-bit code for describing it. Training and evaluation of the DBN are performed on JAFFE, Oulu CASIA, CK+, and MMI datasets.
Deep Sparse AutoEncoders (DSAE) are used for the feasible and effective FER using eight facial expressions, including HD features, both geometric and appearance. The research not only provides the learning of robust and discriminative features but also compares the proposed method with three states of the art methods. Furthermore, it can be used for more complex systems catering to various performance requirements [17]. Shallow CNN (SHCNN) with three layers to classify static and micro-expression data is introduced to overcome overfitting on small datasets (FER 2013, FER+, CASME, CASME II, SAMM) and ignore redundancy. This FER is for static and micro-expression image data and provides the best results for FER 2013 and SAMM datasets while comparable results for the remaining datasets [18]. To cope with the challenge of intra-class variation, mini-Xception based on Xception & CNN (mini-Xception Architecture) to FER 2013 dataset is applied, providing a real-time visioning system including face detection & emotion classification simultaneously. The method performs well even with small data with limited features. Adjusting the count of convolutional layers & the filter size can further improve performance. Mini-Xception algorithm, for the original image dimensions, is used [19].
In [20], the author proposed a technique for recognizing facial expressions that predict the expression types on the input image's face region. The photographs acquired in an unrestricted setting were taken into consideration. A facial region is recognized in each image taken during implementation. The colour facial image is then converted into a grey-scaled image to speed up the operations. The goal of using these photographs is to distinguish the many types of expression in them. This few-shot learning mechanism to build a realistic framework improves performance compared to the RNN and handcrafted features. The novel Distribution Matching Approach (FaceBehaviorNet) is a practical framework for heterogeneous multi-task learning. This framework is not limited to large-scale face analysis, but it can also leverage the concept of zero and few-shot in its two case studies. The framework is evaluated using the FER Datasets. The analysis results show that the framework is suitable for continuous effect estimation, facial action unit detection, basic emotion recognition, facial attribute detection, and face identification. Nevertheless, the framework can be extended to improve its generalization abilities [21]. Another IFER-DTFL model is presented by [22], where the author classifies facial expressions using three different stages. For the detection of face ResNet50 and Mask RCNN is used, for the optimization of hyperparameters adam optimizer is implemented, for the features extraction DenseNet121 is used, and for classification WKELM approach is used. Furthermore, [23] proposed densely CNN with the combination of hierarchical spatial attention to classify the facial expressions. By the combination of these two modules, the relevant emotional features are seized to get better classification results. The experiment results show the supremacy of the proposed model.
An Inconsistent Pseudo Annotation to Latent Truth (IPA2LT) with end-end LTNet is proposed to overcome the errors and biases due to inconsistent facial expressions VOLUME 11, 2023 annotation. Performance is improved for 7 FER Datasets (CK+, MMI, Oulu-CASIA, SFEW, CFEE, RAF, AffectNet), and results are enhanced for noisy, inconsistent, synthetic, and actual data [24]. The basic aim is to cut the framework development cost and create a framework for general affect recognition. One such framework is called: FaceChannel: A fast & furious Deep NN for FER using lightweight NN with few parameters to learn and adapt the learned facial features towards different datasets (AffectNet, FER+, FABD, OMG-E). The system's strengths are improved performance with fewer trainable parameters and cross-data analysis with other affective recognition conditions. Studying and deploying this model on platforms with reduced data processing capabilities, such as social robots, can be a future contribution [17]. Similarly, Deep CNN for noisy labels contributes to developing an enhanced FER+ dataset with ten taggers to label each input and can be implemented via four different approaches: majority voting, multi-label, probabilistic, and cross-entropy loss [25]. Suppressing difficulty while annotating large-scale datasets can be dealt with Self Cure networks (self-attention mechanism and careful labelling). The synthetic and real-world uncertainties of the Synthetic FER, WebEmotion, RAF-DB, AffectNet and FERPlus datasets are suppressed using the proposed approach [26]. As these studies are evaluated using the FER dataset; thus, there is a need for FER system performance enhancement. Therefore, Facial Image Threshing (FIT) machine was developed by following these principles: removing irrelevant facial images, facial image collection, misplaced face data correction, original dataset merging, and data augmentation. The FIT machine recognizes emotions, enhances performance, and can be used by researchers and developers to create an independent FER Dataset [27]. The cognitive model for facial expression recognition is emotion classification using CNN on the FER+ dataset and tested in live validity mode. It can also be implemented in the health sector. It provides faster results with multiple cameras, and efficient results with clear visuals [28]. Making a FER system in real-time video streams using Cycle-GAN's four steps; preprocessing, CycleGAN-based design, generating new synthetic image data, and DNN-based inference system. The transfer learning is modified with a balanced version, including face detection and FER. The method is simple, providing a high recognition rate and speed of facial expressions on the Original FER 2013 dataset. The same can be used in video sequences, and online learning platforms [29]. A hybrid deep learning model to classify human emotions using DNN and transfer learning is proposed by [30]. This study combines multiple deep learning models to make a hybrid classification approach for better prediction results. According to the author proposed hybrid approach achieved 81.42% accuracy on testing data and 95.93% accuracy on training data. A facial recognition system using a deep learning-based ensemble model can be used as a passive factor in two-factor authentication [31]. Similarly, the ensemble method is presented for real-time facial expression recognition that extracts the facial regions in real time and compresses data using multiple features. It solves issues of insufficient data and expression unrelated to intra-class differences using the Local Pathway network and Attention Module. The datasets used for analysis are FER+, and CK+ [32]. Branching is used instead of a single deep convolution network to reduce the redundancy and cost of the ensemble model training process. Redundancy is thus reduced by varying the branching levels, maintaining diversity and generalization power. Moreover, the generalization error is reduced too. All these experiments are being done on AffectNet and FER+ datasets using an ensemble with shared resources. The study can be further extended for approaches to overcome catastrophic forgetting in ensembles using shared resources [33]. Using datasets such as JAFFE, CK+, PIE, and real-world photographs, this study provides an effective facial expression recognition strategy to discriminate diverse human expressions for the retrieved images. Moreover, this method incorporates traditional preprocessing and deep convolution network-based feature extraction to improve the proposed system's overall accuracy and retrieval time. Improved cat swarm optimization (ICSO) algorithm is used for feature selection and image retrieval using ensemble classifiers such as support vector machine (SVM) and neural network (NN). The recognition rate, precision, recall f-measure, and sensitivity matrices are used to evaluate the performance results of their proposed model [34].
MFP aggregation network based on convolutional neural sub-networks learns deep characteristics from facial patches, which are then gathered into a single model architecture for classifying emotion [35]. The framework used Conditional Generative Adversarial Networks and a set of transformation functions for data augmentation. When evaluated on small datasets, the proposed MFP-CNN performs well, but with just three convolutional layers per sub-network, it cannot outperform contemporary DNN trained on large datasets. On the other hand, the MFP-CNN does not require a lot of processing power or large training datasets, and it trains quickly. Therefore, the proposed design could aid in deciphering the traits of face areas in soft biometric features, emotion recognition, and other attributes such as age, gender, ethnicity, and identification. The Deep Attentive Centre Loss (DACL) method adaptively selects a subset of significant feature elements for enhanced classification [35]. Using the intermediate spatial feature maps in CNN as context, the DACL incorporates an attention method to estimate attention weights connected with feature importance. The estimated weights accommodate the sparse formulation of centre loss to achieve intra-class compactness and inter-class separation for the relevant information in the embedding space. RAF-DB and AffectNet are used wild FER datasets in research. The approach uses sparse re-formulation of centre loss to manage the contribution of the deep feature representations in the Deep Metric Learning's objective function. By provid-ing attention weights to the sparse centre loss, an attention mechanism fully parameterized by a customizable neural network predicts the probability of contribution along all dimensions.
In summary, FER is one of the most successful methods for providing emotional expressions knowledge, and it is often limited to six fundamental emotional learning plus neutral learning [36]. Several deep learning architectures have recently been proposed and developed. They've offered various services to detect human emotions using databases of random pictures collected from the real world and other laboratories. To verify the neural network, researchers learned and evaluated patterns in various datasets-the viability of architecture. The recognition rates differ from one database to the next while adhering to the same basic principles. The techniques employed in classification, sufficient memory, and model performance will be influenced by the expanding volume of data, processing time, and computer resources required for research activities. Some data augmentation techniques were applied to the training images to improve the learning parameters of the proposed deep-learning models. By finetuning the hyper-parameters, the proposed data augmentation strategies (bilateral faltering, unsharp filter, sharpening filter, image rotation, image scaling, shear mapping, image zooming, image filling, horizontal image flip) not only tackle the problem of overfitting but also improve the performance of the proposed system. The trade-off between used data augmentation and deep learning-based features causes the FER system to accept more difficult test samples to recognize expressions. Moreover, the prevention of vanishing gradient problems using landmarks is done by fusing landmarks (geometric and temporal information) with conventional video-based methods. The methodology is effective for given datasets (CK+, MMI, AFEW), resulting in the fusion of image information and landmarks. The research can be extended to more landmarks and various modalities utilization for multi-modal FER systems [37].

III. PROPOSED METHODOLOGY
This section discussed the steps of the proposed methodology for emotion classification. In the first step, FER2013 and CK+ image is input dataset using the CV2 Library. After pre-preprocessing of datasets, FER2013 and CK+ different experiments are performed. Then, the classification of emotion is performed using a convolutional neural network. Finally, the evaluation of the proposed method is performed using accuracy metrics. Fig.2 shows the proposed methodology diagram for classifying emotions.
First, images of different persons with multiple emotions as input to the model. After this extract, the face features from face images such as mouth, nose, and left and right eyes. Next, zoning is applied to extracted features to extract the more relevant face features. Then, on extracted face features, the classification of emotions is performed. Finally, the evaluation metrics are used to evaluate the performance of the proposed approach.
Furthermore, Fig. 3 shows a step-by-step flow of the proposed facial expression classification model. The proposed emotion classification model aims to classify facial expressions into eight desired classes. Thus, the step-bystep flow of the proposed approach consists of several steps. In the first step, input images are loaded by calling the open CV2 function with the path parameter of the given facial expression dataset. In the second step, loaded images of facial expressions are passed to the preprocessing module to resize all images into some fixed size, as the CNN model only processes images with uniform dimensions. Therefore, resizing is performed to resize all input images into 64 × 64 dimensions. Next, facial features are extracted from the input images, such as mouth, nose, and eyes. The zoning method is applied to the extracted facial features in the next step to extract more relevant and promising features. Furthermore, the prepared features set is divided into three sets: training, validation, and testing. In addition, a robust model is trained using training samples of images and validated using a validation sample set to evaluate the performance of the proposed emotion classification model. Moreover, testing samples are used as unseen samples to evaluate the effectiveness and generalization of the proposed model.

A. DATASETS
For the proposed model, two well-known publicly available datasets of images: FER2013 and CK+ are used. The first dataset, FER2013, is from a paper from Goodfellow et al. [38], written after a workshop/challenge in the Facial Expression Recognition domain. It consists of approximately 30,000 images categorized into seven classes. The seven classes from FER2013 are angry, disgust, fear, happy, sad, surprise, and neutral. However, FER2013 data contain images misplaced and grouped in the wrong class. It also contains irrelevant facial expression images, which cause computational overhead.
The second benchmark dataset, CK+, is acquired from Lucey et al. [39], which consists of 920 trainable facial expression images with eight classes. The CK+ dataset classes include anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise. Each image size of the CK+ dataset is 640 × 490 pixels. It is an extension of an older dataset from the year 2000, namely the CK. Furthermore, the CK+ dataset contains only frontal face pose images and a small number of facial images compared to the FER2013 dataset. Moreover, the small number of training samples of the CK+ dataset produces bias and unreliable results. However, DL models require a large number of training images to build a reliable and generalized model. Therefore, in this research study, both datasets are substituted and merged to form a reliable dataset for classifying facial expressions into eight classes.

B. DATA PREPROCESSING
The final task for the proposed model is to correctly predict the emotion expressed by each face in the FER2013 and CK+ VOLUME 11, 2023 datasets with high accuracy. To achieve this goal, a technique called zoning is used [40]. The proposed technique utilizes the MTCNN library to calculate the key points needed to locate the four main facial zones in each image as shown in

1) FEATURE EXTRACTION WITHOUT ZOONING
Selecting the areas belonging to each face feature will be crucial for us, as the face part will contribute to correctly predicting the emotion represented in each image. For this reason, each face part of the original image is cropped and the entire face that is kept in its entirety. Each face part is then considered an ''extracted feature'' for participation in the learning process: for this reason, each face feature is saved as a file into its dataset. Our bottleneck feature stacking is based on the research study of Wang et al. [32], which improves the proposed model's overall performance. As explained earlier, the data is partitioned into six different datasets based on each facial feature. Then, a separate classifier is trained from each facial feature, eventually extracted from the same facial feature image. The classifier is VGG-16 [41], fine-tuned after hyperparameter optimization. Our needs were the extraction of the feature maps from the last pooling layer [42] -these are the bottleneck features. Eventually, six separate feature maps are obtained with the size of K × K × 512, that can be directly stacked to get a K × K × 3072 vector. Fig. 7 shows a visual process of the bottleneck feature stacking process to better visualize this process.

3) FEATURE EXTRACTION USING ZONING
Selecting the areas belonging to each facial zone is crucial for the proposed model, as each zone will contribute to correctly predicting the emotion represented in each image [43]. For this reason, we cropped each zone out of the original image and the entire face, which is kept in its entirety. Each zone is then considered an ''extracted feature'' and is used to participate in the learning process: for this reason, each zone (feature) is saved as a file into its dataset. Finally, zones are made for each image after extracting all features by tearing them into two identical Zones. So, now we have 18 features in our dataset. We now have 18 separate datasets, one for each facial feature and another for the whole face, which is shown in Fig. 8 Furthermore, the data split is done to report experiment results as follows: 70% for training, 15% for validation and 15% for testing.

C. CLASSIFICATION OF FACIAL EXPRESSION
For the second part of our model, a dataset made of stacked feature maps (bottleneck features) is used. After this process, our proposed classifier is employed to classify facial expressions using such bottleneck features. Fig. 9 shows the flow of the classification part. It consists of a series of fully connected layers [44] that end with a final dense layer of 8 classes designed to produce a vector containing the probabilities for the input to belong to a certain class. As a side note, in the hidden layers part of the network, different dropout layers are also added [45] and tried different activation functions, such as the Leaky-ReLU [46].

D. EVALUATION METRICS
The performance of our proposed model is evaluated using Accuracy, Precision, Recall, and F1 scores. Based on these metrics, our proposed model is evaluated and compared with the previously proposed models.

1) ACCURACY
The accuracy measure describes how much our classifier correctly predicts facial expressions from all data points. The formula of accuracy is in Eq. 1:

A. DATA AUGMENTATION
Since the CK+ dataset is quite small, data augmentation would improve the performance of the proposed model. Data augmentation was applied to the training set only, and the validation set was left intact. The augmentations so applied increased the number of samples in the training split based on the ImageDataGenerator augmentation library. More specifically, the detail of the data augmentation mechanisms applied are: 1) Image rotation of ±20 degrees.
2) Re-scaling (required for normalization), factor of 1/255.  The initial results are a bit disorienting, thus the 6 VGG-16s for our feature maps and the other neural network made of Dense and Dropout layers are used. Fig. 10 shows a comparison of training and validation accuracy and loss analysis using the FER13 dataset. Though in a more comprehensive perspective, the results of FER2013 does not meet our expectation. The results were not good because the data was highly random. It was because of the unclarity of most images, making it even harder for human experts to classify and tell the exact expression. So the proposed model performance results of FER2013 is reported for other scientific community for any future research insights.

2) PROPOSED MODEL RESULTS USING THE CK+ DATASET
In the given Fig.11, the results in terms of loss and accuracy for the dataset CK+ using all features bottleneck combined like eyes, nose, left eye, right eyes, mouth, forehead and full cropped face, using the above-mentioned model. So, to recap,  hyper-optimization techniques were applied to VGG-16 for fine-tuning the fully connected part from the initial training. After 40 epochs, 0.08 loss, and 98.47% validation accuracy is obtained, which is good, and also, the loss convergence seems very fast, while the validation accuracy is consistent with the training accuracy.

3) EXPERIMENTAL RESULTS OF CK+ WITH CROP FACE IMAGES
Another experiment performed is more comparative in terms of achieving something better than the baseline by using zoning instead of just throwing images of faces to a CNN. For a more in-depth understanding, let us consider a generic VGG-16 as our baseline for a chosen model. The model was realized to make this comparison; The plots for training/validation loss curves and the loss curves are shown without any preprocessing and just take images of faces as the input. As indicated by the caption of Fig. 12, VGG-16 (fine-tuned) is used as our starting point. The results seem quite acceptable, getting us a 91.8% on the validation set. For better performance, the model with zoning is considered, which achieved 98.4% accuracy on the validation set.

C. EXPERIMENTAL RESULTS USING THE ZONING 1) MODEL USING THE CK+ DATASET WITH ZONING
In the given Fig. 13, the results in terms of loss and accuracy for the dataset CK+ using all features bottleneck combined like eyes, nose, left eye, right eyes, mouth, forehead, full cropped face and mentioned zones, using the abovementioned model. To recap, each VGG-16 has been finetuned, and the '' fully connected part'' is trained from scratch.
After 40 epochs, 0.025 loss and 98.74% validation accuracy is obtained, which is great, and a few more decimals than the previous results without zoning.    Even if the accuracy is manually calculated,the confusion matrices predicts five samples wrong out of 170, so when a formula for accuracy is applied, the same 0.9705% accuracy for the test set is achieved.

D. COMPARISON WITH BASE LINE
This section performs the comparison of the proposed method with the existing method. Table 1 shows that the proposed method improved the results from 89.4% to 97.24 on the VOLUME 11, 2023  CK+ dataset. Furthermore, based on comparative analysis, it is found that the proposed method improved accuracy by 7.84% as compared to the state-of-art method presented in [25]. Similarly, it is also found that our proposed emotion classification method using the CK+ dataset improved an accuracy of 4.4% compared to the latest state-of-the-art study presented in [32].
Similarly, a detailed comparison of the proposed model is summarised in Table 2 using CK+ data set. The proposed model results are compared with three different models:   AlexNet, ResNet, and CNN. The comparative analysis shows that the proposed model gives significantly better results than other baseline models. The proposed model achieved slightly higher results than the other three models in terms of accuracy. But our model achieved substantially higher performance when compared the other three matrices: accuracy (ACC), precision (PR), recall (RE), and f1 score. Furthermore, Table 3 compares the proposed method with the existing method. It is also shown that the proposed method improved the FER dataset results from 55.15% to 65%. Based on comparative analysis, it is found that our proposed method improved by 9.75% as compared to the baseline approach presented in [37].
Similarly, Table 4 illustrates a detailed comparative analysis of the proposed model using the FER data set. This comparative analysis uses AlexNet, ResNet, and CNN as competitor models. A comparative analysis is presented in terms of accuracy, precision, recall, and f1 score to highlight the proposed model's significance compared to baseline models. The proposed model achieved 65% accuracy, which is 1.36% higher than AlexNet, 3.18% higher than ResNet, and 1.97% higher than CNN. Also proposed model gives significantly better precision, recall, and f1 score results than other baseline models.
Moreover, a comparative analysis is visualized in Fig. 15 to evaluate and compare the performance of the proposed and baseline state-of-the-art models using the CK+ dataset. The visualization analysis indicates that the proposed model achieved an accuracy of 97.24%, a precision of 96.1%, a recall of 97.5%, and an f1 score of 96.6%, demonstrating that the proposed model outperformed the state-of-the-art DL techniques, such as AlexNet, ResNet, and baseline CNN.   Furthermore, it demonstrates that the proposed model improves the accuracy of 2.01%, 3.37%, and 2.54% compared to the AlexNet, ResNet, and baseline CNN, respectively. Similarly, in terms of precision, the proposed model produced better results than the AlexNet, ResNet, and baseline CNN models, with 13.1%, 20.1%, and 17.6%, respectively. Moreover, the proposed model also gives better results than AlexNet, ResNet, and baseline CNN models in terms of recall with 14.25%, 23.88%, and 19.0%, respectively. The performance of the proposed model in terms of f1 is improved by 14.1%, 22.35%, and 18.6%. Hence, our proposed model is reliable and effective for facial expressions classification compared to state-of-the-art techniques.
Similarly, Fig. 16 visualizes the performance of the baseline and proposed models given in Table 4 using FER+ data. The visualization analysis indicates that the proposed model achieved an accuracy of 65.0%, a precision of 64.6%, a recall of 61.5%, and an f1 score of 61.8%, which shows that the AlexNet, ResNet, and baseline CNN are all outperformed by our proposed model. As a result, the proposed model performs better than AlexNet, ResNet, and baseline CNN in terms of accuracy by 1.36%, 3.18%, and 1.97%, respectively. Similarly, in terms of precision, the proposed model produced better results than the AlexNet, ResNet, and baseline CNN models with 2.0%, 3.4%, and 2.0% respectively. Moreover, the proposed model also gives better results than AlexNet, VOLUME 11, 2023 ResNet, and baseline CNN models in terms of recall with 1.4%, 3.0%, and 1.6%, respectively. In the case of the f1 score proposed model improved by 1.4%, 2.9%, and 1.8%. Based on these performance analysis, we conclude that our model is more reliable and effective than state-of-the-art facial expression classification techniques.

V. CONCLUSION
Facial expressions play an important role in perceiving information in communication and define the emotional state to effectively interpret human reactions and feelings. In this study, a robust deep learning-based technique is proposed for analyzing and recognizing facial expressions based on the role of zoning. Zoning of facial features is performed to divide and localize the subject face images. The role of zoning is utilized to locate more face landmarks for the identification of deep face emotions. The proposed approach is categorized into four main steps. First, face landmarks are extracted, and the second role of zoning is employed to get four regions of each localized landmark. Third, the VGG-16 learning model is used to generate the feature map and finally fully connected neural network model is used to classify facial emotions. One of the motivations of this study is to build a robust FER model and realize it on open-source libraries and coding frameworks. Moreover, this study used two benchmark datasets of facial images called CK+ and FER2013 for the evaluation of the proposed approach. The accuracy of the proposed technique with face features on CK+ and FER2013 are 98.4% and 65%, respectively. The experimental zoning results improve from 98.47% to 98.74% on the CK+ dataset. The performance analysis results provide promising directions and motivation for applying facial features in the feature extraction process to improve the FER model's performance. The zoning of images indicates that the application of zoning in the feature extraction process s improves the FER model's performance. However, the full datasets should be explored more in-depth to enhance and solve the problem of occlusion in facial images.
In future work, we will realize the proposed model in smart environments based on online face recognition systems in the Internet of Things environment. Furthermore, the performance of the proposed approach can be further improved by using ensemble learning and hybrid optimization mechanisms. Moreover, the architecture of the fully connected network will be improved to handle complex data.

ACKNOWLEDGMENT
Taimur Shahzad and Murad Ali Khan contributed toward conceptualization, methodology, and software. Furthermore, Taimur Shahzad, Murad Ali Khan, and Khalid Iqbal performed formal analysis and prepared original draft. Similarly, Khalid Iqbal, Imran, and Naeem Iqbal contributed toward data curation, visualization, and supervision. In addition, Imran and Naeem Iqbal reviewed and edited the original draft, and validated and investigated the overall manuscript and funding acquisition. The authors declared that there is no conflict of interest regarding publishing the role of zoning in multi-class facial expressions using deep learning. (Taimur Shahzad and Murad Ali Khan contributed equally to this work.) MURAD ALI KHAN received the B.S. degree in computer science from COMSATS University Islamabad, Attock Campus, Punjab, Pakistan, in 2020. He is currently pursuing the integrated Ph.D. degree with the Department of Computer Engineering, Jeju National University, Republic of Korea. He has professional experience in the software development industry and academics as well. His research interests include ML and data mining related applications.
IMRAN received the Ph.D. degree from the Computer Engineering Department, Jeju National University, Republic of Korea. He worked as a Researcher with MCL and JNU Big Data Laboratory. He is currently an Assistant Professor with the Department of Biomedical Engineering, IT Convergence College, Gachon University, Republic of Korea. His research interests include software development, IT convergence-based solutions, and entrepreneurship. His research mainly focuses on interdisciplinary scientific applications based on the Internet of Things, machine learning, data science, and blockchain.
NAEEM IQBAL (Member, IEEE) received the B.S. degree in computer science from COMSATS University Islamabad, Attock Campus, Punjab, Pakistan, the M.S. degree in computer science from COMSATS University Islamabad, Attock Campus, in 2019. He is currently pursuing the Ph.D. degree with the Department of Computer Engineering, Jeju National University, Republic of Korea. He has professional experience in the software development industry and in academic as well. He has published more than 30 papers in peer-reviewed international journals and conferences. His research interests include AI-based intelligent systems, data science, big data analytics, machine learning, deep learning, analysis of optimization algorithms, the IoT, and blockchain-based secured applications. He is serving as a professional reviewer for various well-reputed journals and conferences.