3DPalsyNet: A Facial Palsy Grading and Motion Recognition Framework Using Fully 3D Convolutional Neural Networks

The capability to perform facial analysis from video sequences has significant potential to positively impact in many areas of life. One such area relates to the medical domain to specifically aid in the diagnosis and rehabilitation of patients with facial palsy. With this application in mind, this paper presents an end-to-end framework, named 3DPalsyNet, for the tasks of mouth motion recognition and facial palsy grading. 3DPalsyNet utilizes a 3D CNN architecture with a ResNet backbone for the prediction of these dynamic tasks. Leveraging transfer learning from a 3D CNNs pre-trained on the Kinetics data set for general action recognition, the model is modified to apply joint supervised learning using center and softmax loss concepts. 3DPalsyNet is evaluated on a test set consisting of individuals with varying ranges of facial palsy and mouth motions and the results have shown an attractive level of classification accuracy in these tasks of 82% and 86% respectively. The frame duration and the loss function affect was studied in terms of the predictive qualities of the proposed 3DPalsyNet, where it was found shorter frame duration’s of 8 performed best for this specific task. Centre loss and softmax have shown improvements in spatio-temporal feature learning than softmax loss alone, this is in agreement with earlier work involving the spatial domain.


I. INTRODUCTION
HE task of action recognition is a computer vision problem that has been subject to a significant amount of research for varying actions types.Specific sub-tasks within this area have been studied such as human motion, sports and facial exposition recognition where varying degrees of success have been shown [1], [2].Within the medical domain there is significant interest in technology which can successfully detect human actions, primarily for medical pathologies which affect an individual's neuromuscular system resulting in atypical movements.Through tracking the levels of atypical motion over time clinicians can establish the current severity of both progressive and regressive conditions.One such condition is facial palsy, in which sudden onset in the loss of facial muscle motion occurs due to damage to the cranial nerve.This nerve damage produces an extreme asymmetrical appearance which can be especially significant in the eyes, brow and mouth regions of the face both when at rest and during the forming of facial expressions.Previous medical research [3]- [5] has highlighted the correlation between patient outcomes and the diagnosis and rehabilitation prescribed by trained medical professionals, specialised therapy plans tailored via regular feedback resulted in the best patient outcomes [6].The potential to use such a system on a smart device has the potential to provide the clinician with more regular objective feedback on the condition and tailor therapy without always needing to physically see the patient.This is especially beneficial in scenarios where the distance between clinician and patient is large or the availability of either party to meet is limited.As the face plays a major role during interpersonal communication and facial expression the onset of facial palsy can have a significant psychological impact upon the patients.The capability to track rehabilitation privately within a comfortable setting like their own home may also provide a benefit to some patients.
To develop an automated system that can assist medical When is a smile a smile.The left image show the smile of a facial palsy patient, the centre a smile and on the right a asymmetrical motion which is similar to a smile professionals in the tracking and planning of a facial palsy rehabilitation plan, there are a number of challenges.Two key functions of a potential system are the capability to recognise facial motions and grade the facial palsy level.In a clinical setting the medical professional would guide the patient through a range of specific facial motions for facial palsy grading so that the medical professional supervision would ensure correct motions were carried out.The challenge of recognising specific facial motions, for example a smile, have been heavily researched especially in facial expression recognition problems; however in the case of facial palsy recognising the asymmetrical nature of the facial motion adds a further challenge [1].Fig. 1 provides an example of this specific challenge.In the case of automated grading of facial palsy only a small amount of research has been conducted mainly limited to traditional methods using Local Binary Pattern features and a Support Vector Machine for classifier on a small sample size [7], [8].Recently, [9] have applied a Convolutional Neural Network (CNN) based method to a larger data set of 2000 images.While the method has shown promising results the technique still uses images rather than video data.It is known that the temporal information available from video data can provide further discriminative information to ascertain a facial expression [10].This temporal information also has the potential to boost facial palsy grading thus providing the capability to examine the range of motion across an entire action rather than a single frame of the current methods.
The recognition tasks from video sequences are still challenging and as such they have yet to show the dramatic increase in performance accuracy that has occurred in detection tasks from static images.While approaches applying deep learning based methods have been proposed, such as Recurrent Neural Networks (RNN) [11], Two-stream [12] and C3D [13], from the research to date each method has shown some limitations.RNN based networks have been shown to be incapable of capturing the powerful convolutional features for recognition tasks [1].Two-stream methods use both image data and optical flow features to represent the spatial and temporal data, respectively, and have shown to produce some of the most promising results through they require pre-processed optical flow features that adds additional computational overhead.While the C3D method uses 3D convolutional layers to learn spatio-temporal features and has demonstrated good performance accuracy on the sport action data set, it does not generalise well to other more complex recognition tasks [2].This is mainly due to the relatively small video data sets available for optimising the large number of parameters in 3D CNNs.In addition the C3D network is shallow in comparison to the state-ofthe-art architectures used in image based recognition tasks where deeper networks have generally performed better.The introduction of a new Kinetics data set [14] that contains 300,000 videos has provided a large scale data set has the potential to train deep 3D CNNs that have the capability to generalise well to other action recognition tasks [2].
Recently, the research team developed a new multi-task framework for joint face detection and facial landmarks locating, namely Integrated Deep Model (IDM), which has been demonstrated with robust performance on face and landmark detection.Based on this initial work, a further novel framework 3DPalsyNet, for facial palsy diagnosis is proposed, where the IDM is cascaded with two further specific 3D ResNet components that are designed to detect mouth motion and carry out palsy level grading, respectively.Fig. 2 shows the schematic view of the new 3DPalsyNet framework.In the framework, besides engaging the IDM model to address the challenge of facial palsy analysis, a fully 3D end-to-end CNN architecture with ResNet backbone was specifically designed, while the framework leverages the Integrated Deep Model [15] to initially perform face detection on video sequence frames.The fully 3D end-to-end CNN network is then trained via transfer learning for mouth motion estimation and facial palsy grading, respectively.
In summary, the novel contribution to knowledge, outlined in this paper, includes, 1) Extending the IDM to video-based facial modelling and proposing 3DPalsyNet a new framework for facial palsy analysis; 2) 3DPalsyNet a new framework which includes a 3D CNN architecture using ResNet backbone to address the needs of facial mouth motion and facial palsy grading, respectively.
3) To train the 3D CNNs, a new Center Loss based transfer learning scheme was developed for the spatio-temporal domain.We also carry out transfer learning via training on the Kinetics dataset, and apply the learned model for the two tasks.
The ablation experiments were designed to investigate the effect of loss function and frame duration on classification accuracy.
The remainder of this paper comprises of a review of relevant work within section II, followed by an in-depth overview of the methods proposed within section III.Section IV is a discussion of the experiments undertaken and the results obtained.Section V presents a conclusion.

II. RELATED WORK
The task of action recognition is well established within the field of computer vision, with applications ranging from identifying sports based upon the movement of the participants [16] to human facial emotion recognition [17].Unlike the methods applied in object detection which deals with only the spatial domain, the learning of discriminative temporal domain features from motion data across n frames of a video sequence adds further challenges to action recognition task.A selection of the methods proposed for this challenge are discussed within this section.

A. CLASSICAL METHODS
Prior to the rise of deep learning and convolutional neural networks many techniques were proposed to extract spatiotemporal features from videos frames for action recognition problems.Optical flow is a well established method that depicts the pattern of apparent motion of image objects between two consecutive frames, caused by the motion of objects.More recently, Liu et al. [18] proposed a new optical flow based feature, called Main Directional Mean Optical-Flow (MDMO), which is a variant of Histogram of Optical Flow (HOOF).This feature was validated on 36 separate regions of interest on the subject's face and has shown to produce a very compact feature vector with each region being described by only two values (the direction and magnitude of the optical flow vector).Optical flow features are represented by a 2D vector where each vector is a displacement vector showing the movement of points from first frame to second.As discussed later in this section optical flow is still useful within some state-of-the-art methods [12].Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) were proposed for facial texture motion in Zhao et al. [19] and found popularity for action recognition problems due to their ability to describe motion textures efficiently.Further improvement to this method to reduce the feature size were proposed in [20].

B. 2D CONVOLUTIONAL NEURAL NETWORKS
The two-stream 2D CNN-based approach for action recognition has proven to be a popular techniques with this field.Originally proposed by Simonyan et al. [12] the twostreams refer to one stream which takes RGB images data for computing appearance features and the second stream extracts stacked optical flow features to provide discriminative motion information.The combination of both appearance and motion information resulted in improved results in the benchmark action recognition performance at the time of publication on the UCF-101 [21] and HMDB-51 [22] data sets.The two-stream method has been further studied to improve action recognition performance [23]- [25].However, the generation of stacked optical flow features usually result in an increased computational complexity to this architecture.

C. 3D CONVOLUTIONAL NEURAL NETWORKS
Recently, 3D CNN-based approaches have begun to show promise in the task of action recognition as they have been able to leverage the introduction of large-scale training data sets.In contrast to the two-stream methods described previously these architectures require only a single input to the network in the form of a video stacked as a set of individual frames.The extension to 3D convolutional kernels intuitively allows for the shift from the spatial domain to feature domain in the spatio-temporal domain, where the 3 rd dimension captures the motion across the temporal plane.One of the first fully 3D CNN based models was proposed in Tran et al. which they termed C3D [13].The model used fully 3D convolutional kernels applying the Sports-1M data set [16] for training of the models parameters.Through model evaluations they found that 3 × 3 × 3 convolutional filters video sequences are extracted for the second stage of the framework.The IDM method leverages a cascaded approach integrating a Faster R-CNN network trained for face detection and a Facial Alignment Network (FAN) to strengthen face detection precision.This integration is achieved through a heat map transformation and integrates a loss function.Given the heat map output of the FAN as H = h1, h2, ..., hn where each hi is a n x m matrix equal in dimensions to the input image for the ith facial landmark, each value in hi corresponds to the probability of the facial landmark being located at that specific pixel location within a given face image.We propose a novel method as given by equation ( 1) to transform the heat map H to a probability score that can be applied to the task of face detection by integrating it with the loss function of the Faster R-CNN face detector.
1 produced the best performances.Expansion of the temporal length showed further improvements in recognition accuracy to the 3CD model were reported in [26].In the same study it N i=n was reported that applying optical flows as inputs to the 3D CNN resulted in a higher level of performance than can be obtained from RGB inputs with the best performance being achieved when using a combination of RGB and optical flows.3D CNN architectures using the Kinetics data set for training from scratch displayed results that were comparable with the results of ImageNet trained 2D CNN architectures in [14].Recently, complex 3D CNN architectures have been explored where initial studies were limited to shallow ResNet architectures [27].However, more recently this has been expanded to much deeper ResNets with up to 152 layers and other architectures including ResNeXt-101 [2] which has shown to achieve the best performance on the Kinetics test set.The study has also found that a Kinetics data set pre-Given by the maximum probability max(Hi) for the i th facial landmark a specific scaling factor γi is applied for the corresponding landmark.The sum of the scaled probability is then normalised and can be considered as the probability of a face detection derived from the FAN network defined as pfan.The scaling value γ is primarily introduced to deal with wide ranging face poses, in which certain landmarks retain visibility across all poses, where others become occluded.Two values are applied to γ where facial landmarks that are visible across all facial poses are given a value of γ = 1, while other landmarks are given γ = 0.75.The values for γ were selected as they have been shown to perform optimally in Storey et al. [15].

III. METHOD
This section discusses the 3DPalsyNet framework proposed for the tasks of mouth motion recognition and facial palsy grading.3DPalsyNet is comprised of two distinct stages: the initial stage relates to video pre-processing employing face detection and landmark localisation to locate the faces from each frame of the sequence.The Integrated Deep Model is used in this stage.The detected face images are then cropped to the face and the number of frames per sequences are normalised to fixed length.The second stage comprises of two 3D CNNs one for each of the face analysis tasks.The proposed 3DPalsyNet framework is shown in Fig. 2.
The next step is to define the joint probability of a face region termed as pface defined in equation (2) where pfaster is the probability based upon the output of the trained Faster Face features.The penalisation factor δ is specifically introduced for situations where extremely small detections are classed in the very high 90% probability range as being faces when they are not.The value of δ is determined by equation (3) where det is the width of the face detection box and img is the width of the image.It is worth noting that a probability penalisation is only applied when a face width is less that 2% of the total image width.Finally the pface is used within the loss function for the face detection classification as described in equation ( 4).

A. FACE DETECTION AND VIDEO SEQUENCE PRE-PROCESSING
The Integrated Deep Model (IDM) [15] allows for accurate face detection which has shown to provide both high recall Softmax loss is calculated as given in equation ( 5), where xi ∈ R d is the ith feature of the yith class, the feature dimension is defined by d.Wj ∈ R d denotes the jth column of the weights W ∈ R dn in the last fully connected layer and b ∈ R n is the bias term.Mini-batch size and the total number of class are defined as m and n, respectively.
Center loss is defined in equation ( 6), where cy i ∈ is the yith class centre of the learnt feature.The feature centers are updated after each mini-batch of training data.The total loss of the network is calculated by equation ( 7), where λ is used for balancing the two loss functions.Center loss is a significantly larger value and therefore requires scaling down.Based upon the experimentation in [29] a value of λ = 0.001 is used within the proposed 3D CNN.

B. 3D CNN ARCHITECTURE
3D convolutional architectures are a natural extension of the 2D counterparts that have been widely applied successfully to many image classification tasks.While 2D convolution filters have proven efficacy at learning discriminative features in the spatial domain, they lack the capability to extract spatiotemporal features in action classification tasks, where the input is typically video sequences.The 3rd dimension of the 3D CNN provides the mechanism to learn these spatiotemporal features.The proposed 3D CNN method adopts the ResNet [28] architecture as the backbone of the network, this architecture has been highly successful for image classification tasks.The capacity to develop deep ResNet architectures is related to the use of shortcut connections, allowing the data signal to bypass one layer and moves to the next layer in the sequence; this permits the gradients flow from later layers to the early layers.A basic ResNet block consists of two convolutional layers (Fig. 3 highlights the block design), and each convolutional layer is followed by a batch normalisation and a ReLU.A shortcut pass connects the top of the block to the layer just before the last ReLU in the block.
Unlike previous 3D CNN works in [2] we adopt the joint supervised learning of both softmax loss and center loss.It has been shown in other facial analysis tasks that using softmax loss only results in large intra-class variations of the learned features [29].Therefore, the adoption and use of center loss will improve the inherent inter-class dispensation and intra-class compactness.

C. 3D CNN MODEL TRAINING
Both of the proposed 3D CNN architectures are trained with the following protocols for their specific tasks.Initially a ResNet18 model is pre-trained on the Kinetics data set for the action recognition task [2].Transfer learning is then used to train the models for their respective facial analysis task, where the initial layers weight parameters are frozen, only the last convolutional layers parameters and the fully connected layers trained.The layers of the network are trained using a hybrid data set by combining samples from the CK+ emotion and a facial palsy data set with relevant labels for the associated task (Section IV details the breakdown of the dataset and the associated class labelling).To address the class imbalance within the dataset a weighted sampling is employed so that each mini-batch has a similar distribution of class labels.Prior to the training process the video sequences in the training set are first passed through the face detection stage of the 3DPalsyNet and the faces are extracted.The extracted face sequences are then re-sized spatially to 112 pixels x 112 pixels and temporally to n total frames.In this work we consider different values for n.When a sequence is less than n frames duplicate frames are interpolated into the sequence while those greater than n have frames removed at equally spaced intervals.Data augmentation techniques are applied to increase the total samples.To help avoid overfitting random flipping, rotation and colour jitter with 50% probability are employed.Two stochastic gradient descent optimisers are then applied to train the network in order to model and fine tune the parameters and to tune the center loss parameters.The training parameters include a learning rate of 0.1, with a weight decay of 0.001 and 0.9 for momentum.Each model was trained for 50 epochs which was sufficient to minimise model loss.

IV. EXPERIMENTAL EVALUATION
This section presents a thorough experimental evaluation of the proposed 3DPalsyNet framework for both facial palsy grading and mouth motion recognition.All experiments are conducted using PyTorch 0.4 on Windows 10 with a Nvidia GTX 1080 GPU.
For the evaluation of the proposed method data the Extended Cohn-Kanade (CK+) [30] and a Facial Palsy dataset were used.The CK+ database consists of 593 sequences generated from 113 subjects, while the facial palsy data set consists of 696 different sequences with 17 subjects collected from online sources.Since the CK+ sequences range from a neutral face and ends at the full expression, they are aligned with the facial palsy dataset by adding reversed frames, so that the last frame is also a neutral expression.While all samples from the CK+ data set are posed, the facial palsy set contains both posed motions and also general motions such as talking.In the case of mouth motion recognition each sequence is labelled as follows: no motion, smile, mouth open and other mouth motions.For the grading of facial palsy the labelling follows the House-Brackmann scale as shown in table.1,which is commonly applied by medical professionals.
To test the models accuracy for the two classification tasks a leave-one-subject-out (LOSO) protocol is adopted.This is to allow for the testing on unseen faces thus reducing any potential overfitting to previously seen faces.In practice we do not build the models to test all subjects in the data set.10 subjects have been used for the evaluation process; they are split equally into 5 having facial palsy (Subjects 1 to 5) and 5 who do not have facial palsy (Subjects 6 to 10).The 10 selected subjects cover the total range of labels for both tasks.Therefore, in total there are 397 samples used for the evaluation.

A. MOUTH MOTION RECOGNITION
Figure 4 provides the overall results for the mouth motion recognition, where it was found that the proposed model has a good predictive capability in this task producing an F1 Score of 82%.In the Figure 6 the results for each of the LOSO test sets are given, we find that all subjects perform reasonably well with F1 Scores close to 80% with the exception of subject 5. On inspection (Figure 5 shows the confusion matrix for this sample) this subject and the samples which prove difficult to classify correctly are an example of an issue which reduce the accuracy for all subjects.This surrounds the overlap in motions that occur between those labelled as others and the rest.There are motions which are similar to a smile, due to the frame normalisation resulting in the possible loss of frames which can differentiate these motions.As this method also uses the global features for learning features it is possible that other motions such as those from the movement of the eyes and brows show overlap across classes therefore reducing accuracy.

B. FACIAL PALSY GRADING
The overall results for the facial palsy grading evaluation are shown in Figure 7.It was found that the proposed model provides a high level of accuracy with a F1 Score of 88%.In Figure 6  Severe dysfunction (barely perceptible motion).6 Total paralysis (no movement).

C. ABLATION STUDY -FRAME DURATION
Frame duration is a potentially significant parameter when processing video sequences.Reducing the sequences to a short frame duration can remove important features while long frame duration's may add redundant information re-  sulting in more computational overhead of the method.In action recognition work of [2] a frame duration of 16 were found to work well, as the task of face motion are typically shorter in duration this study proposes to evaluate shorter frame duration.In this experiment the performance effect on the frame duration is evaluated.Table .2illustrates the F1 scores achieved for each duration over the test sets for frame duration of 8, 12 and 16.From the results it can be seen that a frame duration of 8 seems to give the best performances.It is to be noted that there are samples which are correctly classified in the larger frame duration but incorrectly graded in the 8 frame duration.This is due to the lack of uniformity across motion duration in these tasks.Not only does this parameter have an affect on the accuracy presented by the model, it also has a large effect on the computational overhead of the framework.This can be seen in Table .2where the use of an additional 4 frames adds about 1 hour to the time the model took to train for 50 epochs of the data set.

D. ABLATION STUDY -LOSS FUNCTION
A joint supervised method for model training, applying both center loss and softmax loss, has demonstrated the capacity to learn a more discriminative feature representation in the spatial domain, then when applying softmax loss alone.In this paper the experiment has been revisited for the spatiotemporal domain, specifically modified for the proposed 3DPalsyNet framework.The study used Subjects 1 to 5 and the results obtained are shown in Fig. 8.For the 366 samples in the facial palsy test, it was found that F1 scores of 86% and 82% for center and softmax loss and softmax loss alone, respectively.This has resulted in a small improvement of the performances as might be expected in image recognition problems.On the other hand, the results obtained for mouth motion recognition have shown to more difficult to improve as demonstrated by a significant decrease of F1 score going from 82% to 49% when also applying center loss.

FIGURE 1 .
FIGURE 1.When is a smile a smile.The left image show the smile of a facial palsy patient, the centre a smile and on the right a asymmetrical motion which is similar to a smile VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS i i

1 , 2 m
−(1 − p * ) • log(1 − pface,i) and precision.The requirement for accurate face detection is essential to ensure that faces from each frame of the N i=n −p * • log(pface,i) (4) cls VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS T e losssoftmax = , log e W y i xi +by i W T xi +bj
the results for each of the LOSO test sets are depicted, showing that all subjects from the CK+ data sets, have a palsy grading label of 1 and are all correctly classified.Subject 1 shows a very poor accuracy in comparison to all other subjects.Subject 1 has 29 samples, out of the 20 incorrect grading 16 are within 1± grades.Subject 1 is a specifically difficult set of sequences as most of the facial VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS slight weakness, normal symmetry at rest). 3 Moderate dysfunction (obvious but not disfiguring weakness, normal symmetry at rest) Complete eye closure w/ maximal effort, good forehead movement.4 Moderately severe dysfunction (obvious and disfiguring asymmetry) Incomplete eye closure, moderate forehead movement.5
FIGURE 6. F1 Score by Subject Test Set.expression are not posed but of the individual during normal conversation.

TABLE 2 .
Frame Duration Results