Facial Sentiment Analysis Using AI Techniques State-of-the-Art, Taxonomies, and Challenges

With the advancements in machine and deep learning algorithms, the envision of various critical real-life applications in computer vision becomes possible. One of the applications is facial sentiment analysis. Deep learning has made facial expression recognition the most trending research ﬁelds in computer vision area. Recently, deep learning-based FER models have suffered from various technological issues like under-ﬁtting or over-ﬁtting. It is due to either insufﬁcient training and expression data. Motivated from the above facts, this paper presents a systematic and comprehensive survey on current state-of-art Artiﬁcial Intelligence techniques (datasets and algorithms) that provide a solution to the aforementioned issues. It also presents a taxonomy of existing facial sentiment analysis strategies in brief. Then, this paper reviews the existing novel machine and deep learning networks proposed by researchers that are speciﬁcally designed for facial expression recognition based on static images and present their merits and demerits and summarized their approach. Finally, this paper also presents the open issues and research challenges for the design of a robust facial expression recognition system.


I. INTRODUCTION
Emotions are efficacious and self-explanatory in normal day-to-day human interactions.The most noticeable human emotion is through their facial expressions.The Facial Expression Recognition (FER) is quite complex and tedious but helps in various applications areas such as healthcare [1]- [3], emotionally driven robots, and human-computer interaction.Although the advancements in FER increases its effectiveness, achieving a high accuracy is still a challenging task [4].The six most generic emotions of a human are anger, happiness, sadness, disgust, fear, and surprise.Moreover, the emotion called contempt was added as one of the basic emotions [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Min Xia .
FER is a baffling task and its accuracy is completely dependent on the parameters selected, such as illumination factors, occlusion, i.e., obstruction on the face like hand, age, and sunglasses.Researchers of the field are taking these parameters into consideration while making their FER models so that the considerable accuracy can be achieved.The description of some important factors for FER is as follows.
• Illumination factor: The light intensity falling on the object affects the classification of the model.The textural values increase the false acceptance rate due to either by minimizing the distance between classes or by increasing the contrast [6].
• Expression Intensity: The expression recognition is highly dependent on the intensity of the expression.The expression is recognized more accurately when the expression is less subtle.It highly affects the accuracy of the model.• Occlusion: If occlusion is present on an image then it becomes difficult for the model to extract features from the occluded part due to inaccurate face alignment and imprecise feature location.It also introduces noise to outliers and the extracted features.
FER systems can be either static or dynamic based on image.Static FER considers only the face point location information from the feature representation of a single image, whereas, the Dynamic Image FER considers the temporal information with continuous frames [7], [8].The static FER process over is exhibited in FIGURE 1 with description of steps as follows.
• Dataset: To avoid over-fitting, the following FER algorithms are discussed, which needs extensive training data.A dataset must have well-defined emotion tags of facial expression is essential for testing, training, and validating the algorithms for the development of FER.These datasets contain a sequence of images with distinct emotions, as mentioned above.We have reviewed many datasets to train different models for real-world profits.Table 4 provides an overview of different datasets available for FER.
• Pre-Processing: This step pre-processes the dataset by removing noise and data compression.Various steps involved in data pre-processing are: (i) facial detection is the power to detect the location of the face in any image or frame.It is often considered as a special case of object-class detection, which determines whether the face is present in an image or not, (ii) dimension reduction is used to reduce the variables by a set of principal variables.If the number of features is more, then it gets tougher to visualize the training set and to work on it.
Here, PCA and LDA can be used to handle the aforementioned situation.(iii) normalization: It is also known as feature scaling.After the dimension reduction step, reduced features are normalized without distorting the differences in the range of values of features.There are various normalization methods, namely Z Normalization, Min-Max Normalization, Unit Vector Normalization, which improves the numerical stability and speeds up the training of the model.
• Feature Extraction: It is the process of extracting features that are important for FER.It results in smaller and richer sets of attributes that contain features like face edges, corners, diagonal, and other important information such as distance between lips and eyes, the distance between two eyes, which helps in speedy learning of trained data.
• Emotion Classification: It involves the algorithms to classify the emotions based on the extracted features.
The classification has various methods, which classifies the images into various classes.The classification of a FER image is carried out after passing through pre-processing steps of face detection and feature extraction.Various classification techniques are discussed later in the proposed survey.
The FER system has various applications such as computer-human interactions, healthcare system [9]- [14], and social marketing.In the proposed survey, we analyze the existing surveys pertaining to different approaches of FER proposed by the authors globally.We compare the surveys and develop a taxonomy on various pre-processing, feature extraction, and emotion classification steps.We also discuss the various open issues and future research challenges related to FER.

A. MOTIVATION
Paul Ekman first coined the term FER in the mid-1980s.Since then, various machine learning techniques like random 90496 VOLUME 8, 2020 forest classifiers, artificial neural networks, etc. were used by the researchers to recognize the seven basic emotions.They also claimed good and effective results.Automated human emotion detection is all-important in security and surveillance applications these days.To further improve its performance, the researchers are trying hard to explore further in this field.
Various challenges like occlusion in datasets, over-fitting of models, etc. have to be taken care of while implementing the FER.As per the literature explored and knowledge of the authors, no survey is available, which exhaustively compares the FER approaches from the perspective of AI.Motivated from the aforementioned fact, we present a comprehensive survey on FER using Artificial Intelligence (AI) techniques in which we have explored the state-of-the-art machine learning and DL (DL) approaches with their merits and demerits.

B. SCOPE OF THE SURVEY
Facial sentiment analysis is the most trending topics in Computer Vision area.A lot of literature has already been published by researchers across the globe in this field, but still, many researchers are trying to solve the challenges and issues in FER.Various surveys have been published in recent years [7], [15], [26], [27] on sentiment analysis.These surveys have mainly focused on traditional methods like support vector machine (SVM), decision tree classifiers, and artificial neural network (ANN).The DL methods [28], [29] have rarely been explored by the researchers working in the same field.So, in this paper, we analyzed the surveys on facial sentiment analysis and presented a comparative analysis.For example, Hemalatha and Sumathi [15] surveyed various methods for facial detection, facial feature extraction and classification of FER, but not presented the proper comparison of methods considered and the dataset used.Later, the authors in [16] also presented the survey on FER, but they had not mentioned anything about datasets useful for emotions recognition.Another survey of Chengeta and Viriri [19] was on various traditional feature extraction techniques like principal component analysis (PCA), Linear Discriminant Analysis (LDA), and Locally Linear Embedding (LLE), and thereafter they proposed an ensemble classifier.They failed to compare the advanced DL approach, which is currently the most novel approach in FER.Again, Baskar and Kumar [18] also lacks in explaining various DL approaches.Recently, DL-based FER approaches has been explored in [7], [27], which are the detailed surveys without the discussion on FER.Therefore, in the proposed survey, we make a systematic survey of various databases used for FER, various methods for face detection, facial feature extraction, and emotion classification, future challenges, and current issues in facial sentiment analysis.Our aim for this survey that it would be quite beneficial for those who want to explore in this field and they will get a complete overview of all the advanced systematic approaches in facial sentiment analysis.Table 2 presents relative differences between the existing surveys with the proposed survey.

C. RESEARCH CONTRIBUTIONS
In this paper, we surveyed various existing literature on Facial Sentiment Analysis focusing on the DL techniques, datasets, and the methodologies used to classify emotions.Following are the crisp contributions of the paper.
• We present an in-depth survey on FER methods and dataset used.Then, we highlight the advanced methods used for FER and their comparative analysis.
• We present a taxonomy on FER methods based on face detection, feature extraction, and emotion classification.
• Finally, we presented the open issues and research challenges in the Facial Sentiment Analysis.

D. ORGANIZATION
Structure of the survey is as shown in FIGURE 2. Section II focuses on the evolution of facial recognition techniques presented by the authors across the globe and the dataset used.It also describes the need for facial detection, dimension reduction, normalization, feature extraction, and emotion classification.In Section III, we highlighted the bibliometric analysis and methodology used for conducting the proposed survey.In Section IV, we discuss various facial expression databases available for analysis.Section V discusses the proposed taxonomy (facial sentiment analysis taxonomy).
In Section VI, we discuss the open issues and research challenges of FER, and finally, Section VII concludes the survey.Table 1 lists all the acronyms used in the paper.

II. BACKGROUND
This section focuses on the background and importance of facial expressions for sentiment analysis.It is bifurcated into four subsections.Firstly, we discuss the evolution of the time-line of facial recognition methods.Secondly, we discuss the need for Facial Detection, Dimension Reduction, and Normalization for sentiment analysis.In the third subsection, we focus on the need for feature extraction from the face image.Finally, we highlight the need for emotion classification.

A. EVOLUTION TIMELINE
Figure 3 gives a brief overview on the evolutionary time-line of facial sentimental recognition methods given by the researchers across the globe along with the datasets.There exists various algorithms for FER such as traditional state-ofthe-art algorithms and DL-based algorithms proposed by various researchers till 2020.The emotion recognition was first stated in the paper proposed by Bassili [30] in 1978 where authors have classified the emotions into six basic gestures such as happiness, sadness, fear, surprise, anger, and disgust.Different algorithms (traditional and DL) were used for FER by the authors.For example, Padgett and Cottrell [31], the first time (ANN) in 1996, SVM [32] in 2000, CNN [33] in 2003, Multi-SVM [34] in 2006, boosted DBN [34] in 2014, RNN [35] in 2015, and (PHRNN and MSCNN) [36]   these FER models.The time-line shows the list of datasets as per the creation year.Datasets are-JAFFE [37] in 1998, CK+ [38] in 2000, MMI [39] in 2002, Oulu-CASIA [40] in 2008, Multi-PIE [41] in 2009, (RaFD [42], MUG [43], and TFD [44]) in 2010, and FER-2013 [45] in 2013.As new datasets are being available supported with DL algorithms to solve the challenges in FER.

B. NEED FOR FACIAL DETECTION, DIMENSION REDUCTION AND NORMALIZATION
In the FER process, the first pre-requisite step is face detection, which involves the detection of a face in the image or frame and removes the insignificant pixels.The face detection algorithm gives the output in the form of coordinates of the bounding box, which is put over the face.Detecting a face is quite complex as the human faces can be in different sizes and shapes.So, the face detection algorithm plays a vital role in the aforementioned situation.Various algorithms for face detection are available such as Viola-Jones [46], PCA, LDA, and genetic algorithms.Viola-Jones algorithm is one of the most widely used algorithms for face detection.It differentiates faces from the non-faces.PCA is the other most widely used face detection method.It is used to reduce the image dimensions and has four main parts:-feature covariance, eigen decomposition, principal component transformation, and choosing components [47].Reducing the dimensions from m − dimensions to n − dimensions: ∀ m > n does not means we are losing the properties of the image, moreover it preserves [48].After the dimension reduction, normalization can be used to scale-up the image.

C. NEED FOR FEATURE EXTRACTION
Facial Expression analysis comprises of various methods such as facial landmark identification, feature extraction, and different feature extraction databases.Facial landmarks are drawn by the facial key points which are derived from the geometry of the face [16].Feature Extraction is done after preprocessing phase [49].There are two methods available for feature extraction are appearance-based extraction and geometric-based extraction.The geometric-based method extracts feature like edge features and corner features.Neha et al. [50] analyzed the performance of the feature extraction technique Gabor filter.They also tested the average gabor filter and compared both the filtering techniques to enhance the recognition rate.
• Corners: Corners of an image is a significant property, which can be inferred from the complex objects of the image.Cho et al. [51] developed the corner detection technique, which measures the distance and angle between two straight lines.
• Edges: They are one-dimensional features that represent the boundary of an image region.
The second method, which is an appearance-based method, takes care of the states of different points of the face, such as the position of the eye, shape of important points such as mouth and eyebrows using the salient point features.
The majority of the traditional methods have used Local Binary Pattern (LBP) as the feature extraction technique, which is a generic-based framework for the extraction of features from the static image.It converts the most important features of the input image, as mentioned above, into a histogram [52].

D. NEED FOR EMOTION CLASSIFICATION
The third step in the FER is the Emotion Classification.There are various methods that are used for the classification of emotions after applying face detection and feature extraction algorithms.The various classification algorithms are convolutional neural network (CNN) [53], SVM, and restricted boltzmann machine (RBM).The most widely used method for classification is CNN.It is the most efficient algorithm as it can be applied directly to the input image without applying any feature extraction and face detection algorithms and still gets better accuracy over the input data [54].The number of images in the training data set also has a huge impact

III. SURVEY METHODOLOGY
In this section, we present the methodology followed to conduct the proposed survey such as search strings used, research questions, and the authentic data sources.

A. RESEARCH PLANNING
The proposed survey initiated with the discussion and identification of various quality research questions, data sources, as well as search criteria.We identified the relevant surveys proposed by various researchers and if data is found relevant, then we extracted data from it [55].

B. RESEARCH QUESTIONS
The proposed survey found out the existing literature on Facial Sentiment Analysis.The identified research questions are specified in Table 3.

C. DATA SOURCES
Various literature has been studied for a thorough and complete survey.We followed the genuine digital databases such as Springer, IEEEXplore (early access, magazine, and transaction articles), Science Direct (elsevier), ACM digital library, Google Scholar for accessing the existing literature surveys on Facial Sentiment Analysis [56].

D. SEARCH CRITERIA
The search is performed using some standard keywords like ''Facial Sentiment Analysis'', ''Facial Emotion Recognition'' and other matching keywords as mentioned in FIGURE 4.
There exist many articles in different digital libraries, where the search string is not present either in the title or abstract [57], then a manual search process was done for such research papers.

IV. FACIAL EXPRESSION DATABASES
In this section, we discuss various datasets that are currently used for training and testing in FER.We also presented a comparative analysis of these datasets based on the articles 90500 VOLUME 8, 2020 published till-date.The comparative analysis of the datasets is presented in Table 4.

A. EXTENDED COHN-KANADE (CK+)
CK+ is the most widely used dataset for FER systems [7], [38], which contains almost 593 different sequences captured from 123 different subjects.Sequences may vary between 10 to 50 frames, whereas frames shows the shift from a neutral face to a specific expression [7].

B. THE JAPANESE FEMALE FACIAL EXPRESSION
JAFFE data-set includes 219 different images with 7 facial expressions captured from 10 Japanese female models [73].
Images of each and every model were captured while looking through a camers with semi-reflective plastic sheet.
To remove less illumination problem on the face, tungsten lights were used.

C. RADBOUD FACES DATABASE (RAFD)
RaFD database is the database of portrait images of 49 subjects of 39 Dutch adults and 10 Dutch children [42].All models have 8 facial emotions like neutral, sadness, happiness, disgust, anger, contempt, surprise, and fear with three gaze directions.Every emotion in the image was shown with eyes coordinated straight ahead, deflected to the left side, and turned away to the right side.Pictures were captured with white background from five different camera angles at the same time from left to right with 45 • angle.There are total of 120 images for each model.

D. DELIBERATE EXPRESSION DATASET MMI
MMI database is a laboratory-controlled database which contains 326 sequences from 32 subjects [39], [74].There are total of 213 sequences labeled with 6 expressions and the main advantage of this database is that there are 205 sequences captured in frontal view.The difference between CK+ and this database is that this database contains sequences which starts with a neutral expression and reaches to the specific expression at the middle of the sequence and then returns to the neutral expression.It is a complex database because there exist large relational variations due to the images have the same expressions and non-uniformity (many of the images have glasses, long hair, and mustache).
Researchers widely use the first and the last three frames to the final expression to perform emotion recognition task [7].

E. FER2013
This database was first established during the ICML 2013 challenges of Kaggle [7], [45].It includes a large number of images and important characteristics.It also includes unconstrained images collected automatically by Google image search API.It contains 28,709 images for the training set, 3,589 images for validation set, and 3,589 images for test set, which sums up to total 35,887 images with 7 expression labels [7].All these images have been reduced to the size of (48 × 48) pixels.This is one of the most challenging datasets in FER.

F. TORONTO FACE DATABASE (TFD)
It is the combination of different FER datasets [44].
It includes 1,12,234 images, 4,178 of which are labeled with one of the seven expression labels, such as sadness, surprise, fear, happiness, anger, disgust, and neutral [7].The main advantage of this dataset is that it contains images that have faces been already detected and reduced to the size of (48 × 48).There are 5 folds in this database where each fold has 70% of images for the training set, 10% of images for the validation set, and remaining for test set [7].

G. MULTI-PIE
This database contains 755,370 images with 337 subjects [41].The main advantage of this database is that it includes images with 15 different viewpoints and 19 diverse illumination conditions and each image is named as one of the 6 expression.This is mainly used for multi-view 3D FER [7].

H. STATIC FACIAL EXPRESSIONS IN THE WILD (SFEW)
IT was designed by selecting frames (static) from the acted facial expressions in wild (AFEW) [67].The advanced SFEW 2.0 was the benchmarking data used for EmotiW 2015 challenge.It includes 958 images for train, 436 images for validation, and 372 for test set labeled with 7 expression labels [7].

I. OULU-CASIA
Oulu-CASIA NIR (near-infrared) and VIS (visible light) facial expression database with six diverse expressions (surprise, happiness, sadness, anger, fear, and disgust) having 80 subjects between the ages of 23 and 58 years [40].73.8% of the subjects are males and the rest are females.The image resolution is 320 × 240 pixels.

J. MULTIMEDIA UNDERSTANDING GROUP
It is also known as MUG [43] dataset, which is a laboratory-controlled dataset with 6 basic emotions-anger, disgust, fear, happy, sad, surprise, and neutral.This database is divided into two segments.In the initial segment, the subjects were approached to play out the six basic emotions.The subsequent part contains a research facility that prompted feelings.There is a total of 86 subjects and 1462 image sequences.A camera has the option to click pictures at a pace of 19 frames/second.Each picture is in jpg format with (896 × 896) pixels and 240 to 340 KB size.This dataset was made to defeat the restrictions of other comparable datasets in FER, such as high goals, uniform lighting, numerous subjects, and numerous clicks per subject.

K. AFFECTNET
It is a dataset for the identification of wild human expressions.
It has above 1 million different images to be collected from the internet source by using over 1200 keywords related to human emotions.

L. EMOTIC
It is a facial emotion dataset with EMOTions In Context.
It collected the images of people in the real environment with apparent emotions.It has widest 26 emotions categories.

V. FACIAL SENTIMENT ANALYSIS: THE PROPOSED TAXONOMY
This section presents the taxonomy for Facial Sentiment Analysis, which is splitted into three subsections such as pre-processing (face detection, dimension reduction, and normalization), feature extraction, and emotion classification.The detailed taxonomy for FER is shown in FIGURE 5.

A. FACIAL DETECTION, DIMENSION REDUCTION, AND NORMALIZATION
In this section, we discuss the various face detection, dimension reduction, and normalization techniques that are widely used in FER models.We also highlight and compare the various face detection techniques proposed by different researches.Table 5 shows the comparison of various stateof-art detection techniques available in existing literature.[83].There are two regions, as shown in the figure, black shaded and white shaded regions.The eigenvalue is calculated using the difference between those two regions for linear features [83].
• Integral Graph: As the dimension of the generated Haar feature is large, a technique called integral map can be used to isolate the picture cells, such as 2D coordinates of the gray-scale picture and the estimations of every pixel point [83].The procedure to make an integral graph is that each pixel is made equivalent to the total of all pixels above and to one side of the concerned pixel.Henceforth, the total of all pixels in the rectangle shape is determined.
• AdaBoost Training: This training algorithm is a weak classifier and is made to learn multiple times to become good.The two things that we need to consider for the classification of an image.First, the locale of the eyes is darker than the district of the nose and the cheeks, whereas the other thing is, eyes should be darker than nasal scaffold [75].• Cascading Classifiers: We can classify the face or not a face in a single image using a single classifier, but the result of that might not reach our expectations.So, in Viola-Jones, we use cascading classifiers to make this classification accurate.Stages of the cascading classifier is shown in FIGURE 7. If one of the stages fails, then the image is not a face, but if it successfully reaches the last classifier, then the image is classified as a face and store it into the corresponding database.
Viola-Jones is not capable of handling occlusion and rigid objects.So the algorithm might generate false detection of face.To overcome this issue, the authors in [83] proposed the modified Viola-Jones algorithm using composite features.The procedure to detect the face using the modified Viola-Jones algorithm is as follows.
• A rectangular frame of the face is determined using Viola-Jones.
• The face in the rectangular frame is then calibrated and handled into four sorts of sub-images.
• The features are extracted from the acquired face and four sub-images using Zero/Null Space Linear Discriminant Analysis (NLDA).
• Then, the extracted features are evaluated by using discriminant distance.
• Now new composite feature vectors are generated from the discriminant values and then they are fed to a classifier for face recognition.
2) PRINCIPAL COMPONENT ANALYSIS The fundamental thought behind the PCA is that multiattribute data is projected onto a linear lower-dimensional space.This subspace is known as the principal subspace.The human face can be recognized using eigen face.Eigen space (the basis of faces) is a set of eigen vectors (the covariance network of the face space) is to classify the faces according to their basis representation.Steps to create the eigen faces are described as follows.
• Provide a training set of faces with pixel resolution of w × h.
• Then the mean is calculated and subtracted from each image in the matrix.Weight of the k th eigen face is The size gets (40 × m × n) after filtering its size, but after downsampling its size reduced to (10 × m × n), i.e., the dimension is reduced by a factor of 4. Later, Luo et al. [85] used PCA to extract global features from an image that are important, but they can be environment-sensitive for facial expression.To overcome such issue, some local features are also selected using LBP.

3) LINEAR DISCRIMINANT ANALYSIS
LDA is the same as PCA, which is used to reduce the dimensions of a given data.Like the eigenface in PCA, LDA  uses fisher face (enhancement of eigenface) for reducing the dimensions of the features and the identification of face in an image.Fisher's face is usually used when the images have a contrast in illumination.Various steps to create the fisher face is same as PCA and performs better than PCA [86].

B. FEATURE EXTRACTION
This section discussed the various feature extraction techniques and explained how they could be used in FER models.We also compare the various Feature Extraction techniques and also analyzed various works done using the various techniques, as shown in Table 6.

1) LOCAL BINARY PATTERN
It is the recent texture descriptor that converts the value of the original pixel of the image with a decimal value and converts it into codes known as LBP codes [97]- [99].Labels are formed by thresholding the 3 × 3 neighborhood with a central value and considers the result as a binary number.But the basic LBP had a limitation with large-scale structures and highly sensitive to noise [100], [101].It is also invariant to the rotations and size of the features, which are increasing exponentially with the increase in neighbors.
A Uniform LBP was proposed that considers the U pattern, which has at most 2 bitwise transitions from 0 to 1. So, various extensions of the LBP were proposed for the neighborhoods of any size.A circular neighborhood was proposed with any number of pixels and radius.It is represented by the notation (P, R) where P means the number of sampling points on a circle of radius R.There are various applications where LBP is used, such as-texture analysis, face analysis, and classification.It codifies the local primitives, including the edges, corners, different spots, and flat areas.Nowadays, LBP converts the important pixels of an image into a histogram which is known as the Histogram of Oriented Graph approach (HoG), that stores the information of local-micro patterns of the faces.

VOLUME 8, 2020
A modified algorithm for LBP was proposed by the authors of [89].The steps for generating the threshold are as follows.
• The input preprocessed image is divided into 3 × 3 blocks.
• For each block, calculate the minimum and maximum of block representing the pixel intensity value of the block.
• Now, calculate the threshold value of block B by taking an average of both the minimum and maximum values.
• If any element of the block is greater than threshold, then write '1' to it, else write '0'.
• The eight-bit pattern is converted to a decimal number, representing transformed block B.

2) GABOR FILTER
It extracts both time and frequency domains [102] of the image.Means, it analyzes whether there is any particular frequency content in the image in a particular direction around the point of analysis.The use of 2D Gabor Filter is made in the spatial domain.Gabor Filters is quite successful in FER models.Multi-resolution structures are applied to images which consist of multi-frequencies and multi orientations.These structures relate Gabor Filters to wavelets [103].
The filters having real and imaginary components represents the orthogonal directions.The equations are shown below [102]: where (a,b) is the pixel position in the spatial domain, θ is the orientation of Gabor filter, ω is the radial central frequency and σ is the standard deviation of the Gaussian Filter which means it controls the size of the Gabor Envelope.Liu et al. [87] proposed in his paper the local Gabor filter bank LG (m × n) which spreads all over.It also contains multi-scale information of features those having a global filter or the image as well as it also reduces the redundancy in eigen values.This reduces the time for extracting the features.
[84] used a Gabor Filter bank with eight orientations and five scales.The formed bank is used to filter the generated divided images 40 times each.This created a computational burden from them, so they reduced the number of features by dimension reduction techniques.Dimension reduction using PCA is explained under section PCA.

3) DISCRETE COSINE TRANSFORM (DCT)
A finite sequence of data or feature points as the sum of cosine functions are oscillating at different frequencies is represented by DCT.It is a way to compressing the data/ 2D-image without losing its original meaning [104].In such type of applications (data compression), an input of 8 × 8 size is used for DCT [105] for feature extraction.It has two stages: • The first stage is to apply DCT on the image.
• The second step is the selection of co-efficients [106].By applying DCT on an U × V image then a 2D U × V co-efficient matrix is formed.
where k = 1, 2, . . ., (M − 1) where G x (k) is the kth DCT co-efficient [107].Jayalekshmi et al. [75] in their work have used DCT over fixed discrete sequences to convert the data into elementary frequency components.

4) SCALE INVARIANT FEATURE TRANSFORM (SIFT)
It transforms the input image data into scale-invariant coordinates relative to the local features and stored into a database [108].SIFT features are highly distinctive i.e., it can match a single feature with a large probability from the database.SIFT also features scale and rotation invariant, which means that even if we scale or rotate the image, the features remain preserved.This is useful in FER when a rotated image comes as an input.If we rotate the image, then also the features are maintained, and we can get those features efficiently [109].The stages of the SIFT procedure are as follows.
• Scale-Space Extrema Detection: The first stage searches all image locations and scales using the Gaussian method.
• Keypoint localization: It determines the location and scale at each point of the image.
• Orientation Assignment: Rotation Invariance is performed by assigning one or more dominant orientations to each key point.
• Keypoint Descriptor: A descriptor is made to represent each keypoint, which supports the assigned orientation in the preceding stage.It supports the histogram of the gradient within the image.The changes in illumination are scaled back by the descriptor to the key point.When the keypoint descriptors are received, they are often used as a feature or keypoint for data to solve various problems.More detailed information on SIFT computation are often found in [110], [111].Kravets et al. [90] proposed P-SIFT (Parallel SIFT) algorithm, which reduces the computation time and increases the processing speed.In P-SIFT, the problem is divided into subtasks, and multiple processors are used for feature extraction.The program reads the input image and generates the key points.After that, it matches the key points with respect to each image in the database.Matching is done using Euclidean distance.The images that have a ratio of first least distance to the second least distance is less than 0.8 are taken into consideration.Then the image is given to the classifier that classifies the emotion [112].
A newly adapted method of SIFT to extract features from an image called IntraFace was proposed in [91].It uses multi-directional warping of active visualization model and a supervised descent model [113], which uses the SIFT feature extraction technique for feature mapping and trains a method to extract 49 points from the image.These points are used for registering an average face, which is then termed as the face region.

5) SPEEDED UP ROBUST FEATURES (SURF)
Speeded Robust Features is a type of local feature detector as well as descriptor.It is inspired from SIFT and the authors claim that it is faster than SIFT.Around the interesting point, a certain reproducible orientation is fixed based on information from a circular region.From this selected orientation, we make a squared region, and the SURF descriptor is extracted [114].The SURF has two parts (i) a detector and (ii) a descriptor.The location of the key points of the image is provided by the detector and the descriptor expresses the features of those key points.SURF uses Hessian matrix for the fast assessment of box filters.The integral images are expressed as The Hessian matrix is represented as: The introduction of pyramid scale space is done in SURF because of box filters.The descriptor makes use of the sum of Haar wavelet features that increases the robustness and decreases the computation time.For extraction, it constructs a square region around the keypoint and oriented along the orientation decided by a method.Each region is split into 4×4 sub-regions, which keeps the important spatial features.The Haar wavelet computes at 5 × 5 sampled points [88].

6) HISTOGRAM OF ORIENTED GRADIENTS (HOG)
The features extracted by HoG are tough against photometric and geometric deviations.Many applications use HoG as the feature extraction technique, such as human detection [110].
The steps to calculate these features are: • In the first step, it divides the entire image into small cells.
• Then, its direction and the magnitude is calculated.
• Calculate the Bin for each direction as well as magnitude using HoG.
• The blocks from adjacent cells are calculated, and block normalization are created to calculate feature vector from it.During implementation, 9 bins of histogram were calculated using the cells of 8 × 8 pixels and blocks of 2 × 2 pixels with unsigned orientation.Reference [115] used a fusion of HoG and LBP features.Firstly, the HoG and LBP features were extracted from the segmented parts.The final feature vector consists of both the features of HoG and LBP, which had 1892 features out of which 1656 came from HoG while the rest came from LBP.

C. EMOTION CLASSIFICATION
In this section, we discuss the various Classification techniques that are used to classify human face emotions.We also presented a comparative analysis of various emotion classification techniques, as shown in Table 7.

1) CONVOLUTIONAL NEURAL NETWORK (CNN)
CNN is most widely used architectures in computer vision techniques as well as in machine learning [130].A massive data is required for training purpose to harness its complex functions solving ability to its fullest [131].CNN uses convolution, min-max pooling, and fully connected as layers than the conventional fully connected deep neural network [53], [132], [133].When all these layers are stacked together, the complete architecture is formed.The complete architecture of CNN is shown in FIGURE 8.
• The input layer of CNN contains the image pixel values.
• The Convolutional layer convolves the l × l kernels with x feature maps of its preceding layer.If the next layer has feature maps, then n×m convolutions are performed and n × m × (w × h × l × l) Multiply-Accumulate (MAC) operations are needed, where h and w represents the feature map height and width of the next layer [132].The important function of Convolutional layer is to calculate the output of all the neurons which are connected to the input layer.The activation functions such as ReLu, sigmoid, tanh etc. aim to apply element wise activation and to add the non-linearity into the output of neuron • The pooling layer has the responsibility to achieve spatial invariance by minimizing the resolution of feature map.One feature map of the preceding CNN model layer is corresponding to the one pooling layer.active feature in a pooling region [134].The max pooling function is as follows: 2) Average Pooling: It has a function u(x,y) (i.e., window function) to the input data, and selects the average value for each input data on the preceding layer feature map [135], [136] Mostly, 2 × 2 pooling can be used without overlapping.This means that M in the above equation is always 2. And the large pooling has M value as 4, 8, 16, which always have a dependency on input image size.So, [136], in their paper, proposed a Multi-activation pooling method in order to satisfy the need of a large pooling region.This method allows top-p activations to pass through the pooling rate.Here p indicates the total number of picked activations.If p = M × M , then it means that each and every activation through the computation contributes to the final output of neuron [136].For the random pooling region X i , we denote the nth-picked where the value of n ∈ [1,p].The above pooling region can be expressed as below where symbol represents the removal of elements from the assemblage.The summation character in Eq. 13 represents the set of elements that contains top1 (n-1) activation, but not adding the activation values numerically.After having the top-p activation value, we simply compute the average of each value.Then, a hyper-parameter σ is taken as a constraint factor which perform the multiplication of the top-p activations [136].The final output refers to Here, the summation symbol represents the addition operation, where σ ∈ (0,1).Particularly,if σ = 1/p, the output is the average value.The constraint factor, i.e., σ can be used to adjust the output values [136].
• Fully connected (FC) layer is the last layer of CNN architecture.It is the most fundamental layer which is widely used in traditional CNN models [137], [138].
As it is the last layer, each node in it is directly connected to each and every node on both sides.As shown in FIGURE 8, it can be noted that all the nodes in the last frame of the pooling layer are converted into a vector and then are connected to the first layer of the fully-connected layer.There are many parameters used with CNN and need more time for training [139], [140].
The major limitation of FC layer, is that it contains a large number of parameters that need complex computational power for training purposes.Due to this, we try to reduce the number of connections and nodes in the FC layer.The nodes and connections which are removed can be retrieved again by adding the new technique named dropout technique.In the past few years, CNN has emerged in Computer Vision, including the field of facial sentiment analysis.Researchers have modified the traditional CNN for better performance.A modified CNN was proposed by Mollahosseini et al. [91] in which an inception layer was introduced.Their network architecture consisted of two elements.Firstly, it had two traditional CNN architectures containing the Convolutional layer, followed by ReLu.Following these modules, they added two inception layers which consists of 1 × 1, 3 × 3 and 5 × 5 Convolutional layers with ReLu in parallel.
By slightly modifying the traditional CNN, Khorrami et al. [120] developed the model based on a classic feed-forward CNN.They introduced a modification in their model by ignoring the biases of the Convolutional layers.The network contained three convolutional layers having filters of size 64, 128, and 256, respectively, with 5 × 5 sized d filters, which were then followed by activation function ReLu.They placed max-pooling layers after each of the 1 st two Convolutional layers, and after the 3 rd convolutional layer, they placed the quadrant pooling layer.Then after convolutional layers, FC layer containing 300 hidden units followed by quadrant pooling was used.At last, the softmax layer was used for classification.
The idea of using zero-bias model was first introduced in [141] for the fully-connected layers in the CNN model and later was extended in [142].They implemented this model on CK+ and TFD datasets.Their model was successful in recognizing the emotions with the rate of 88.6% ± 1.5% on TFD with 7 classes and 95.1%±3.1% on CK+ with 8 classes.They used Data Augmentation combined with dropout to boost the performance [120].
Further, the increase in accuracy and recognition performance of the computer vision algorithms researchers proposed advanced and deeper CNN architectures.Taking into consideration the work of Ding et al. [77] designed a new technique named FaceNet2ExpNet to train the model.They used a fine-tuned face net and proposed a unique distribution function to train neurons of expression net.To improve the discriminativeness of features that were learned, they used the conventional network to design the ession net.The training process was executed in the two levels.In 1 st level, the Convolutional layers were given training using loss function, and the output of the last pooling layer was used for supervision.In the 2nd stage, they added randomly initialized FC layer and then trained the network using labeled training data.The testing was done on constrained as well as unconstrained datasets and achieved better results than the previous approaches [77].
In 2018, Jadhav et al. [54] investigated the previous approaches and applied modifications to increase performance.They investigated three different popular networks that were quite successful in classifying emotions.The first network was proposed by Krizhevsky and Hintion [143].The second network for investigation was inspired by AlexNet [138] Convolutional network and the last one was proposed from work done by Gudi [144].It consisted of an input layer of 48 × 48, one Convolutional layer, normalization layer to reduce normalize dimensions, and then a max-pooling layer.Then again, 2 Convolutional layers and then finally 1 FC layer, linked with the softmax layer.To decrease the number of parameters, [54] applied one more max-pooling layer.They trained and tested the model on FER2013 and RaFD respectively and got better results than [144].
Cai et al. in [123], introduced a new island loss layer in CNN to minimize intra-class variations.Their network includes 3 Convolutional layers, each followed by the PReLU and batch normalization (BN) layer.Pooling was used with the two initial BN layers.After the third Convolutional layer, 2 FC layers and island loss was calculated at the second FC layer.And then, at last, the softmax layer was used.Their architecture was named IL-CNN.They also employed VGG-16 [145] network as their backbone network.This approach VOLUME 8, 2020 achieved good performance in comparison to the state-of-art methods.
In 2012, Simonyan et al. [145] proposed VGG16 architecture for object recognition and classification task.VGG16 has replicative structure of convolution, ReLu and pooling layer.The network architecture of VGG16 is shown in FIGURE 9.
The invention of Residual Networks has created a revolution in the image recognition field.Residual Networks (ResNet) [146] is a classic network used as a backbone for the many computer vision tasks.ResNet50 is the current state of art convolutional neural network, which has the identity mapping capability.It introduced the concept of skip connections.Traditional CNN was successful in FER to some extent, but there were various limitations too.Those limitations were: • It requires a considerable skills and high experience in selecting the appropriate hyper-parameters values for CNN [121].
• It used a stochastic gradient descent method, which caused trouble in mounting to enormous data and also in learning NN with multiple layers.It is because the gradients tended to decrease and also because of the problem of ''vanishing gradients'' [121], [147].
• The last one was for real-time environments, CNNs for human facial appearance was easily affected by various parameters like age, gender, face length, hair, mustache, and ethnicity [148].Due to this, facial expressions had overlapping features, which makes its implementation difficult and complex [121], [149].
To overcome these barriers, ensemble learning was applied because it helped in more accurate classification and prediction by concatenating the outputs of the base learning approaches.So, Dhankhar [124], [150] developed an Assemble model by combining the state-of-the-art DL approaches such as VGG16 and ResNet50.He attempted to obtain a vector of weights from the second to last layers, which can be treated as feature vectors, which represented the latent representations of the input image.Then, he combined above mentioned representations by joining the feature vectors, then it can be taken as input to logistic regression models to calculate the final emotion prediction [124].After applying transfer learning, he trained and tested his model on Karolinska Directed Emotional Faces (KDEF) dataset and got better accuracy compared to individual models.Another Ensemble network was proposed by Wen et al. [121] where they ensembled CNN with probability-based fusion for FER [151].For each ensemble method, the multiplicity and diversity in the classifiers is considered as a major concern in achieving comparable performance [152].The CNN architecture was implemented using ReLU and multiple hidden layers (maxout layer) with random values for them to overcome the problem of calculating the stochastic gradient descents.They used Softmax classifier at the last in order to roughly calculate the possibility to test sample to each class [121].They trained and tested the model on FER2013 and CK+ datasets, respectively and the accuracy was 76.05%, which was better than other methods.Thus it can be concluded that ECNN consistently outperformed traditional CNN.
Alessandro Renda et al. [125] proposed a feed-forward CNN, which was inspired by Kim et al. [153], in which three convolutional and max-pooling layers with 32, 32 and 64 feature maps respectively were placed after input layer.They used an ensemble learning approach to increase performance.These max-pooling layers consist of an overlapping kernel with size 3 × 3 and stride of 2 × 2, which results in size halving.They added a dropout layer after FC hidden layer, having 0.15 as drop probability.A FC layer with almost 1024 neurons was proposed to yield 7 classes of emotion in the FER2013 dataset.Their model has a network depth of 5 with 2,436,007 trainable parameters.They used ReLU (Non-linear function) as an activation function for both the convolutional and FC layers to remove non-linearity and the softmax function at the output layer.They used the batch normalization [154] with each convolutional layer as well as the FC layers.To preserve the data, they used zero-padding in the convolutional layers.They achieved an accuracy of approx.72.249 % with the ensemble of 9 networks.
To solve the problem of poor performance in real applications caused because of the stored facial images that most of the time show expression not as a single emotion but represents a multiple emotions, Gan et al. [126] designed an approach using CNN and soft label which associates with multiple emotions and expressions.They obtained the soft labels using constructor involving 2 step scheme: 1) The initial step is to prepare a CNN model with hard data labels for supervision and the softmax function for optimization.
2) The second step is to fuse the possibility of prediction to get soft labels from the pre-trained models [126].Their architecture is similar to VGG16, however, the last FC layer is adjusted as C-way yields, where quantity C is the number of emotion classes.
Zadeh et al. [127] proposed a DL model having a CNN layer and 2 Gabor Filters to classify different human sentiment.This model uses a feature selection method called Gabor Filter, which is commonly applied for texture outline.It returns where there is any texture change in the image.Then these features are fed to a CNN (Convolutional Neural Network) for the classification of human sentiment.Their model has the following stages-Input Images, resize, 1st Gabor Filter, 2nd Gabor Filter, CNN layer, and classification of sentiments.They tested their model on the JAFFE dataset.They also compared the dataset classification using simple CNN and its model (CNN with 2 Gabor Filters).They trained them for 30 epoch and got an accuracy of 91.16% on simple CNN and 97.16% on their model [127].

2) SUPPORT VECTOR MACHINE (SVM)
SVM is a classifier [155] was designed for classifying out of two classes.If the SVM has more than two classes, then more than one SVMs is to be implemented.There are three methods by which we can implement SVM for more than two classes.
• One versus all: It was proposed in [156].It constructs k SVM models for training data having k number of classes.If there are three classes, then SVM can be performed three times for every class [157].
• One versus one: It was introduced in [156].This method constructs k(k − 1)/2 classifiers, where two classes at a time are taken to train the model.In this method, SVM is performed between every class that is to be classified [157].
• Directed Acyclic Graph SVM (DAGSVM): It was proposed in [158].Its training phase is similar to one-vs-one method.The testing phase makes use of a rooted binary DAG having at the most k(k−1)/2 internal nodes and the maximum k leaves.An advantage of using a DAGSVM is that is to generalize the analysis [158].
The aim of SVM is to identify the maximum margin plane between the classes.The maximum margin plane can be obtained from the maximum distance between the positive and the negative margin plane, respectively of the two classes.
The distance between the separating plane and the positive margin plane should be equal on both sides.
To solve the problem of recognizing emotions from facial expressions in a simple and speeded manner, Datta et al. [122] presented a classification system that used the concatenation of geometric as well as texture-based features to classify the emotions using SVMs.They have used the hierarchical SVM architecture to leverage the benefits of multi-class binary classification.CK+ dataset was used for classification.They have achieved the significant enhancements in the accuracy using hybrid SVM features compared to LBP features.
Nuno Lopes et al. [78] in 2018 given a classification model for FER in the elderly and also present the differences of FER in the elderly and other age people.They used the Support Vector Machine with a multi-class classification for classifying the emotions [159].They proposed two architectures, the first approach removes the wrinkles, nasolabial fold, and other facial features, using edge-preserving smoothing techniques.While in the second architecture, they introduced an algorithm from API Microsoft, which detects the age of the person.The lifespan dataset was used to train and test the multi-class SVM.They used 80% images to test the accuracy of the SVM and 20% to test the accuracy of the application.They got an accuracy of 95.24% in the young age group and accuracy of 90.32% in the elderly age group.
SVM is a linear classifier that can be applied for linearly separable data.But SVM can take high dimensional data as input also which most of the time is non-linear data.So a mapping function is applied to the SVM training, which is non-linear and converts the data into linearly separable but in a higher dimension.This function is called a kernel function.There are various kernel functions, but [85], in his paper, used Radial Basis Kernel Function(RBF).They used one versus one approach in this paper.
Ibrahim Adeyanju et al. [118] proposed a method in which he used four SVM kernels to classify different emotions of faces.They used a Radial Basis, Polynomial, Linear, and Quadratic functions as SVM kernels.They tested their model on 467 training and 238 test sets to classify 7 emotions.They got a maximum average accuracy of 86.4% on RBF kernel, 99.33% on Quadratic function, 97.65% on Polynomial, and 97.86% on Linear.

3) ARTIFICIAL NEURAL NETWORK (ANN)
ANN is inspired by the biological neural networks that constitute the brain [160].Our brain consists of millions of neurons that form a neural network.These neurons are interconnected with each other and process the signals to/from the brain to the other parts of our body [161].This type of link is called synapses.There are approximately 100 billion neurons and are interconnected by thousands or more synapses.In ANN, the signal is a real or binary number and the output of these  neurons is obtained by some activation function of the sum of all inputs to that neuron [160].The connection between two neurons/nodes is called an edge.This edge has a weight, which describes how intensive the signal is.Normally, neurons are aggregated into layers.The first and last layer is the input and output layer, respectively.The in-between layers are called the hidden layers.A simple perceptron model given by W.S McCulloch and W. Pitts is shown in the FIGURE 10.
where n inputs are given to the neuron x1, x2, . . ., xn, weights are assigned to edges as w1, w2, . . ., wn, θ is the step function before calculating the output O.The positive weights correspond to excitatory synapses and negative as inhibitory synapses.The activation function is the sigmoid function.This model does not possess the actual behavior of biological neurons.Talele et al. [119] proposed a model using the General Regression Neural Network (GRNN) based on ANN to classify emotions from the image.The proposed model has the following features-Input layer goes about as feed to the subsequent layers.The pattern layer decides the Euclidean distance and activation function.The summation layer comprises of the numerator and the denominator part took care of by the output layer.
The fundamental principle on which the system works is the joint likelihood estimate of the input and the output as given below where n is the number of watched tests, σ is the spread parameter, x i is the i th training vector, x i is the corresponding yield esteem.The physical interpretation of the likelihood estimate is that it assigns test likelihood of width σ for each input and output test.

4) DEEP BELIEF NETWORK (DBN)
A probabilistic and unsupervised DL algorithm which comprises of numerous stochastic dormant factors.These Idle factors are likewise called as feature locators.It is a hybrid graphical model that has two undirected upper layers, while lower layers have directed connections [162].DBN consists stack of Restricted Boltzmann Machine(RBM) or Auto-encoders.They represent a data vector.The two most important properties of DBN are: • It utilizes layer by layer learning approach that decides how the loads rely on the layer above it, a top-down approach.
• A single bottom-up pass layer which begins with observed data vector and furthermore that utilizations loads in separate layers give the estimation of latent variables [163].
Deep Belief Network is pre-trained using the Greedy algorithm.In greedy algorithms, we train each layer, in turn, in unsupervised learning.The multi-layer DBN is divided into various RBMs, which are learned sequentially.Pre-training is done for better optimization.Fine Tuning is done because the features are modified so that we can get the category boundaries right [164].Lui et al. [116] proposed a novel model named Boosted Deep Belief Network to implement FER with performance enhancement.Their Framework has three main contributions.First, they build the model, which consists of three-stage training of feature learning, feature selection, and classifier construction.Secondly, their proposed work facilitated the part-based representation and not the whole facial region as input, which is highly suitable for expression analysis.At last, they proposed a discriminative DL framework where multiple DBNs are integrated and a boosting technique is also applied.They used an experimental dataset named CK-DB prepared from the first and last three frames of the famous CK+ dataset with a total of 1308 images.The accuracy of the model was found to be 96.7%.
Kurup [128] used five layers.The input layer has two nodes.All classes are represented as 4-bit codes.All other layers have 3,3 and 4 nodes and have a sigmoid activation function.An unsupervised approach is used to train the first layer using contrastive divergence(CD) and then a softmax activation function is applied.At the end of the DBN, a finetuning backpropagation procedure was applied.RBMs are trained layer by layer and each RBM was trained individually five times.5 times k fold cross-validation was used and the training data was divided into 5 groups.With this model, they got an accuracy of 98.57% on the CK+ dataset while 98.75% on the MMI database.
For the FER, Yadan Lv et al. [117] had proposed an approach via DL.Unconventional training of component detectors was done with DBN and was adjusted by logistic regression.After that, the parsed component features, including eyes, mouth are concentrated for expression recognition.The main contributions of their work are, they were the first  to use only facial components to recognize emotion, treated every single feature of parsed component equal and parse the face via DBN so that the images need not to be pre-processed before extracting features.In simple words, their approach at first detected the face, and then the nose, eyes, and mouth are used for expression recognition.Emotion classification was done by a stacked autoencoder classifier.

5) RECURRENT NEURAL NETWORK (RNN)
They are an exciting twist to basic neural networks.RNNs can take a series of inputs with no initial limit on it.They remember the past and make decisions based on past learning.RNNs remember the prior inputs while generating outputs.RNNs take at least one input vectors that produce output vectors impacted by hidden state vectors dependent on earlier sources of input and output, as shown in FIGURE 11.They provide a smart method for managing the sequential data that gives co-relations between data points, which are close in the sequence [165].The information captured by the RNN relies upon the structure and training algorithm it implements.[166] in his work used RNN by assuming the eucledian metric which Arecords the distance between two frame sequences.They used RNN with one hidden layer consisting of 150 unidirectional Long-Short Term Memory fully interconnected cells.They give 1500 frame vectors to the input layer.They trained the RNN with Adam Optimizer with learning rate = 0.0001.
Zhang et al.The results of their model were better than most of the other strategies.
FIGURE 12 shows the comparative analysis of the accuracies of various state-of-the-art approaches on different datasets.

VI. OPEN ISSUES AND RESEARCH CHALLENGES
FER has been an active research area in recent years.There are various works that have shown tremendous results and classified the emotions accurately.Yet there are various challenges and issues which are faced during facial sentiment analysis.In this section, we will discuss about various issues and challenges faced by FER.We analyzed various survey papers and understood the issues.

A. OCCLUSION AND OCCLUDED DATA COLLECTION
It is the major obstacle that comes in the way of automatic facial expression.Most of the current works are on the JAFFE, CK+ datasets without occlusion, and also with artificially occluded faces.There is a lack of datasets that include natural facial occlusion.So, there should be a creation of databases that has occlusion, which is a time-consuming and difficult task to do.Datasets should be prepared by deciding what or where the face should be occluded.Certain crucial parts of the image should not have an occluded region.The effective training and testing of the occluded dataset still remain a big challenge [4].
It is still a challenging task to collect spontaneous expressions under occlusion.The everyday human emotions such as happiness, surprise, and sadness can be easily evoked, but emotions such as curiosity, attentiveness are still difficult to evoke, particularly under occlusion.Therefore several strategies should be considered which induce emotions that are precise and contextual dependent [26].These strategies might bring challenges in implementation on selection and limit the types of occluded data collected.
After the collection of the occluded dataset, it's effective training, and testing remains a big issue.The occluded region, the level of occlusion, the type of occlusion, the components, materials that are present in the occluded region pose a challenge for the FER System.One way to counter this challenge is to use raw pixels of the occluded region, but enough information on the specific features of the image may not be recorded.Reliably determining the special parameters such as type, materials, components, location is a critical component of FER.
There are many ways to detect occlusion in an image.Still, the most current feature extraction techniques extract features from the face directly without passing through a pre-processing layer of occlusion detection.Huang et al. [167] in his work, showed that the accuracy is improved if we include a pre-processing layer for occlusion detection.should include images of people of all age groups as different age groups exhibit emotions differently.There are datasets that have images of a particular age group, but no dataset has a mixture of all the age groups [7].This dataset, if developed, would assist in developing research on cross-age, crossculture, and cross-gender.

C. FER ON 3D DATA
The current research mainly focuses on 2D FER data, which faces challenges to illumination factors and pose variations [168].3D face shape models are naturally robust to pose variations and illumination factors.[169] in his work, proposed CNN without facial landmark detection, which estimates expression coefficients from image intensities.Recently, many works have been proposed which combines both 2D and 3D data to improve the accuracy.

D. DIFFERENT MODALITIES IN FER
Facial Expression is only modality that can be used to recognize human behavior.The combination of other patterns like infrared images, capturing the information of 3D models, and physiological data is trending research area due to large complementary expressions.Reference [170] employed various multi-modal affect recognition techniques.

E. FER ON INFRARED DATA
At present, the gray-scale and RGB colors are at the trend in the deep FER, but are more vulnerable to light effects.However, the infrared images records the emotions produced by skin distribution which are not subtle to the illumination variations.In 2017, Wu et al. [171] given a 3D CNN architecture to fuse spatial and temporal features in FER images.

G. OTHER ISSUES
Various other issues have risen based on the prototypical expression categories, namely real versus fake emotion recognition challenge and complementary emotion recognition problem.Also, the apps for real-time FER is still a challenging task [174].Many DL techniques have been applied regarding the above problems.

VII. CONCLUSION
This paper presents a detailed systematic survey to analyze current state-of-the-art approaches for facial emotion recognition in static images and various parameters that influence the results of these approaches.We have developed a taxonomy based on different methods used for face detection, feature extraction, and emotion classification.Various facial expression databases used as input for the FER are discussed.We have reviewed previous works on this field and concluded that much of the work had been done in this field.We have compared various detection, extraction, and classification approaches and concluded that which approach is more prominent in achieving better performance in available computation power.By discussing current issues and research challenges in the future, we concluded that there is still much research needed in this field, such as FER in 3D face shape models, recognizing emotion in images under occlusion, etc. Real-time FER is still a challenging task.In the future, we would like to survey the FER problem in videos using more advanced DL techniques.

FIGURE 2 .
FIGURE 2. Roadmap of the survey.

FIGURE 3 .
FIGURE 3. Evolution of facial recognition methods and datasets.

FIGURE 4 .
FIGURE 4. Possible strings used to search the literature.
// where Input image vector U ∈ R n (training set) and mean M. Then W = [w 1 , w 2 , . . ., w k , . . ., w n ] • Calculate Euclidean distance (D) of W x and W . • After calculating Euclidean distance, the image is classified as face or not a face.Islam et al. [84] used PCA to reduce the redundant features.They have used downsampling to eliminate the number of redundant features.Consider the size of an image as (m × n).

FIGURE 10 .
FIGURE 10.W.S McCulloch and W. Pitts proposed a single neuron model as a mathematical model for an artificial neuron [160].

[ 36 ]
proposed a PHRNN (Part-based Hierarchical bidirectional Recurrent Neural Network) to classify the different facial emotions.Their model extracts facial features using PHRNN, which extracts the temporal features and used a multi-signal CNN (MSCNN) to extract facial emotion features from the still frames.Their model, Spatial-Temporal Network, mainly consists of PHRNN and MSCNN.Their model has the following stages of interest-PHRNN, MSCNN, and Model Fusion.They tested their model on Oulu-CASIA, MMI and CK+.Their model performs well on Oulu-CASIA, MMI and CK+ and diminishes the error rates from the early trial.They also compared different models-CNN, MSCNN, PHRNN-MSCNN (without sorting item), and PHRNN-MSCNN (with sorting item) with the assessed accuracy of 93.4,95.7, 96.7, and 98.5 respectively.
B. DATASETS IN FERThe other challenge in FER is the lack of proper training dataset in terms of both quality and quantity.The datasetVOLUME 8, 2020

FIGURE 12 .
FIGURE 12. Accuracy of proposed models on different datasets vs year.

F
. VISUALIZATION TECHNIQUES Adding visualization techniques [172] over the CNN model results in a quantitative analysis of how it contributes to the visualization-based rule of FER and also figures out which part of the face has more discerning information.Its results indicate the activations of filters with strong correlation to the face mark regions which correspond to a particular Action Unit.In 2016, Mousavi et al. [173] used the concept of visualization techniques and proposed a new visualization technique LIPNet.
FIGURE 1.The general pipeline of FER systems.

TABLE 2 .
A relative comparison of the proposed survey with the existing FER surveys.

TABLE 3 .
Research questions and their objectives.

TABLE 4 .
Different facial expression databases.

TABLE 5 .
A relative comparison of various state-of-the-art facial detection techniques.
1) VIOLA-JONES FACE DETECTION ALGORITHMViola-Jones is extensively used to perceive the face from an image.The training time of this algorithm quite long, but face identification is fast.It needs the full front view of the face as FIGURE 5.The proposed taxonomy for facial sentiment analysis.aninputimage.It has four stages, such as haar-like features, integral graphs, AdaBoost training, and cascading classifier.•Haar-Like Features: Viola-Jones algorithm uses

TABLE 6 .
Comparison of various state-of-art feature extraction techniques.

TABLE 7 .
Comparison of various state-of-art emotion classification techniques.