I. Introduction
In the last decade, Facial Expression Recognition (FER) research gained a lot of attention due to the advancement achieved in related research areas such as face detection and recognition [1] [2]. These advancements encourage researchers to study facial expressions and to build real-time FER approaches. FER approaches can be divided into two main categories. The first category is the traditional approaches that extract handcrafted features using methods such as Gabor wavelets, Local Binary Patterns (LBP), and Principal Component Analysis (PCA) [3]. Subsequently, the features are categorized into the respective facial expression classes based on classification methods such as Support Vector Machine (SVM) and Nearest Neighbor (NN) [4]. The second category is deep learning-based approaches that rely on reducing the dependence on manual extraction of facial patterns and enable machines to learn directly from the input images [5]. The deep learning-based approaches are composed of three basic steps: pre-processing, feature learning, and feature extraction. The pre-processing step is employed before training to enhance the input images, such as face alignment, image cropping, and face normalization [6]. Feature learning and extraction are performed using Deep Neural Networks (DNNs), which use training data and artificial intelligence algorithms to learn the relations among the extracted features [4]. Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are the most common deep learning approaches used in this era, which enable learning spatial data patterns and temporal data patterns respectively [5]. In the CNN model, the input image is convoluted through a combination of filters; each filter has a particular set of values to produce a specified feature map.