Facial Landmark Detection With Learnable Connectivity Graph Convolutional Network

The conventional heatmap regression with deep networks has become one of the mainstream approaches for landmark detection. Despite their success, these methods do not exploit the overall landmarks structure. We present a new landmark detection which is capable to capture the overall structure of landmarks by modeling these landmarks as a graph structure. Our method combines a deep heatmap regression network with Graph Convolutional Network (GCN) into an end-to-end differentiable model. The proposed method can utilize both visual information and overall landmarks structure to localize landmarks from an image. The ad hoc spatial relationships between landmarks are learned naturally with GCN network. Experiments on multiple datasets show the robustness of the proposed method.

. The difference among them is how to use 25 the information on facial appearance. The coordinate regres- 26 sion methods directly learn the mapping relationship between 27 The associate editor coordinating the review of this manuscript and approving it for publication was Abdullah Iliyasu . discriminative features and coordinates vectors of landmarks, 28 drawing lots of attention. Many previous methods [6], [14] 29 reached satisfactory performances, while the results of coor-30 dinate regression methods are sensitive to face occlusion. 31 Besides, the heatmap regression approach creates a probabil-32 ity heatmap for all target landmarks, which achieved state-33 of-the-art performances in the studies of landmark detection 34 for multiple views [5]. In addition, the landmark detection 35 methods with graphs also have the potential to represent the 36 predefined landmarks as a graph. The landmark detection 37 with graphs makes the landmarks learnable, and it is robust 38 against appearance variations [12], [13]. 39 Graph-structured data are ubiquitous in computer vision, 40 such as point-cloud, human body joints (pose estimation), 41 hand joints (hand gesture classification), and scene graphs. 42 Integrating relation inductive biases from graph-structured 43 data into deep learning architectures is essential for these 44 deep learning systems to learn, reason, predict and generalize 45 well on these kinds of data. Recent years have seen a surge 46 in research on deep learning for Graph-structured data. The 47 advancement in graph representation learning creates a new 48 way for tackling many challenging computer vision prob-49 lems by leveraging the inter-relationship between entities in 50 a scene. 51 A facial landmark can represent a node, and these nodes 52 form a graph that represents the overall facial structure. 53 Unlike some graph-structured data in which the topology of 54 a graph comes naturally (e.g. atoms in protein molecules), 55 facial landmarks do not have an intrinsic graph connectivity 56 scheme. Therefore, the graph topology for facial landmarks 57 is either made using heuristic [15] or learned from data [12]. 58 In this work, the latter approach is selected and further 59 improved with a per-image graph connectivity scheme where 60 graph topology changes based on the input image to increase 61 the system's robustness in some challenging scenarios where 62 the initial prediction of the landmarks may not be reliable.   is not an end-to-end differential model. By using the soft  Ngoc et al. [22]. applied a graph neural network to obtain downsampling factor and C is the number of landmark types. 210 We employ the HRNet18 [23] as a CNN backbone to generate 211 a heatmap. We also use features extracted from the backbone 212 to construct node features for the landmark regression model. 213 High-Resolution Net (HRNet) [23] is a universal 214 CNN-based architecture designed for many computer vision 215 tasks, including object segmentation, human pose estimation, 216 and object detection. The design of HRNet architecture is 217 based on two main concepts: maintaining high-resolution 218 representations and multi-scale features fusion. HRNet archi-219 tecture starts with a high-resolution convolution stream, then 220 gradually adds high-to-low resolution streams as stream flow 221 to the next stages. Every time the network transit to the next 222 stage, the multi-scale features are fused for each stream.  Let h l i be the hidden of vertex v i at iteration l and e ij 231 be the learned connectivity of between node, Information is 232 propagated through the graph G as follow: Following the work of Li et al.
[12], we also enrich Node 236 features with visual features and shape features. We think 237 that visual information can provide some useful information, 238 such as boundary constraint, while shape feature explicitly 239 provides information on the overall landmark structure. This 240 information is beneficial to the GCN landmark regression 241 model for refining initial landmark prediction.
Shape features: similar to [12], We also use displacement 253 between two nodes as shape features q ij = [x i − x j , y i − y j ]. 254 The shape features q ij are concatenated to node features of 255 neighbor nodes v j before aggregate information to target node 256 v i . The shape features are added to the hidden of neighbor 257 nodes for every iteration to ensure overall shape information 258 persists as the graph model progressively updates.  in end-to-end manners. Therefore, the graph connectivity 271 remains the same for a given task and is independent of input 272 images. 273 We argue that may not be the optimal way to handle some For a pair of landmark (v i , v j ) with corresponding landmark 284 types (l i , l j ) and confidence scores (c i , c j ), we first compute a 285 class embedding: where g is one-hot encoding operation. Then the graph 288 connectivity is computed from class embedding and nodes 289 confidence score: A softmax function is apply to the graph connectivity to 292 normalize the signal from neighbors nodes. we use L1 loss on all predicted landmark coordinates to learn 296 precise localization:   Focal loss [26] to stabilize the training process:

355
WFLW is a challenging dataset with multiple difficult 356 detection scenarios. Testing result is reported in Table 1. 357 Following previous research, we evaluate our method 358 with 3 metrics: normalized mean error (inter-occular), 359 AUC@0.1 and FR@0.1. Our method is among the top per-360 formers, achieves 4.24% NME (second best), 2.68% FR0.1 361 (best), and 0.5892 AUC0.1. Training loss and NME on vali-362 dation set is shown in figure 4 363 300W: We also compare our approach with several top 364 performing methods on 300W dataset. Results on common, 365   figure 5 shows the connection of a landmark 377 to its neighbor. As can be seen from figure 5, the graph struc-378 ture varies from image to image. This behavior is intended, 379 and we believe the graph structure's flexibility boosts GCN 380 landmark model performance.

382
In this section, we examine the performance of our proposed 383 method for learning the graph connectivity by comparing it 384 to the learnable task-specific graph connectivity proposed by 385 Li et al. [12]. We experiment with both WFLW and 300W 386 datasets. We use the same HRNet18 backbone pretrained 387 WFLW and 300W datasets and freeze its weights for a fair 388 comparison. The backbone in this experiment achieves the 389  is near identical to the task-specific learnable graph con-397 nectivity method. As the WFLW dataset is considered more 398 challenging than the 300W dataset, we conclude that our   The comparison of these two approaches is analyzed in the 424 ablation study. Another advantage of using heatmap for initial 425 landmark prediction is that we only need a single stage 426 GCN for landmark regression, while [12] method requires 427 a 2-stages cascaded GCN regression model for coarse-to-428 fine prediction because the mean average 2d location is not 429 good enough for coarse prediction. Our method can reuse 430 pre-trained landmark detection directly, which eases the train-431 ing process. We can freeze the heatmap model during training 432 and still obtain a good result and simplify the training process. 433 Other methods, such as WING [35], and AWING [31] 434 about loss function so orthogonal to our approach and can 435 be used in conjunction with our work to improve landmark 436 detection further.

437
Our method is built on top of a heatmap model, and its 438 performance is aligned with the quality of the used heatmap 439 model. Even though we only test our method with HRNet18, 440 our method can be plugged into any kind of heatmap model 441 and enjoy the boost in accuracy as analyzed in the ablation 442 study section. In addition, figure 4 shows a clear gap in 443 NME between landmark prediction from the heatmap model 444 and one from the graph model. It means our graph model 445 can consistently improve the landmark prediction from the 446 heatmap model. On the contrary, an obvious limitation of our 447 approach is that the graph model only performs well when the 448 initial guess from the heatmap model is good enough.

450
Due to its performance, heatmap prediction is currently the 451 mainstream solution for facial landmark prediction. One of 452 the flaws of the heatmap model is lacking a mechanism to 453 exploit the overall structure of the human face to aid the 454 landmark prediction when visual information is insufficient. 455 Using Graph Neural Network (GNN) utilizing the overall 456 human face structure to refine the landmark prediction from 457 heatmap is a good solution in challenging cases such as 458 pose variation, blurry image, low illumination and expression 459 variation. 460 We propose a novel landmark detection model based on 461 a graph convolutional network, which utilizes the overall 462 landmark structure by modelling them as a graph. The graph 463 structure varies depending on the input images for adapt-464 ing to different situations. The experimental results show 465 that our approach is competitive with some current state-466 of-the-art methods. The proposed method can be applied to 467 any heatmap model to boost landmark prediction accuracy.