TCMINet: Face Parsing for Traditional Chinese Medicine Inspection via a Hybrid Neural Network With Context Aggregation

,


I. INTRODUCTION
Nowadays, Traditional Chinese Medicine (TCM) has become a global and essential diagnostic approach in the medical field [1]. In TCM, inspection is a critical diagnostic step to check the current state of patients with an observation of the expression, appearance, color, and abnormal changes of the body, face, and inner facial components (e.g., eyes, lips, tongue). The face and inner facial components are believed to reveal signs of various health conditions or even diseases of the internal body [2]. For instance, people with hepatitis and other liver issues may have a face or eyes with a yellow tone [3]. The tongue of HIV-infected patients may be swollen, and tooth marked [4]. Moreover, the lip color of a person is considered as a symptom to reflect the physical conditions of organs in the body [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Jinjia Zhou . Generally, as preprocessing, the first step of most computer vison-aided facial medical analysis techniques consists in detecting or segmenting face and facial components from face images. However, the existing literature merely focus on detecting or segmenting single face organs [6]- [8]. As a special case of face parsing, face parsing for TCMI amounts to labeling each pixel with the left eye, right eye, lips, tongue, face, and background, following the principles of TCM holistic view [9], [10]. Inevitably, some challenging problems hiding behind this task are as follows. First, the patient opens the mouth wide with the tongue sticking out, and the lower lip is partially (or totally) blocked by the tongue. Second, the tongue color gamut is highly overlapping with lip (face) color gamut. Third, in addition to the face and target facial components, obtained face images contain many other nontarget components, such as hair, beard, teeth, and the inner tissue of the mouth. And fourth, the patient's tongue color, facial expression, skin gloss, and other conditions are more VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ varied than healthy people. There are abundant pathological details on the surface of the patient's tongue, such as tongue crack, red point, tooth marks and etc. These details are often with only several-pixel size, which makes parsing more difficult.
Existing face parsing literature [11]- [13] have illustrated significant advantages by focusing on individual regions of interest (ROIs) for inner facial components. However, these methods [11]- [15] mainly focus on segmenting hair, eyebrows, and other facial components that are rarely relevant to TCMI, rather than segmenting the tongue that is essential for TCMI applications. Face parsing for TCMI is indeed a new challenging task, and too little work has been devoted to this area. Accordingly, proposing a new hybrid architecture that follows the TCM diagnostic principles is of great need. Furthermore, in order to parse face images robustly, effective contextual modeling [16]- [19] is more demanding. Inspired by these methods, we propose a novel TCMINet to estimate masks for each face and each inner facial component separately, which is shown as Fig.1. Specifically, we first construct the Inspection Feature Extraction (IFE) module to complete efficient dense feature extraction with fast computation. Then the hierarchical Facial Inner Components Segmentation (FICS) structure and Face Segmentation (FS) structure are used to process the inner facial components (left eye, right eye, tongue, lips) and the face, respectively. Moreover, we employ context aggregation modules (C1 ↑↓←→ , C2 ) to smooth the label prediction map as well as to refine boundary localization for inner facial components and the face. The symbol ''arrow'' indicates ''the sweeping direction or propagation direction''. For inner facial components, we employ an effective context aggregation module C1 ↑↓←→ [21], [22], which uses four recurrent neural networks to sweep both vertically and horizontally along both directions across the image to incorporate the global context. For the face, we employ an efficient context aggregation module (C2 ) [16], [17], which models semantic contextual dependencies of local representations with four context propagation directions (southeast, southwest, northwest, and northeast. In addition to a high performance network, a good dataset with high-quality and well-labeled images is also a crucial component. There are only a few face parsing datasets, such as the LFW-PL [20] and HELEN [14]. Moreover, most of the images of [14], [20] are not suitable for this task. To mitigate this problem, we construct a face parsing dataset named TCMID, which contains 1500 face images captured by professional imaging devices under certain conditions (in a dark chest, not in open-air). In TCMID, each image is provided with accurate annotation of a 6-category (left eye, right eye, lips, tongue, face, and background) pixel-level label map. The contributions of this paper are summarized as follows: 1 We build a face parsing dataset (TCMID) and benchmark for training and test. To the best of our knowledge, it is the first face parsing dataset for TCMI. Furthermore, we manually relabel some images of HELEN and LFW-PL datasets named LFW-PL and HELEN for test. Datasets are available at: https://github.com/ FDUXilly/TCMID-face-image-dataset. 2 We propose an effective hybrid architecture to address the problem of pixel-wise face parsing for TCMI. We introduce the context aggregation modules (C1 ↑↓←→ , C2 ) that can significantly smooth the label prediction map as well as refine boundary localization for inner facial components and the face, respectively. 3 Our network surpasses previous state-of-the-art results on LFW-PL , HELEN , and TCMID datasets. Besides, ablation studies and exploratory experiments on TCMID are carried out to evaluate the hybrid network structure and important modules of our network. It runs at 267ms per face image (512 × 512) on a GPU, being possible to be integrated into engineering solutions.

II. RELATED WORK A. TCMI -FACIAL MEDICAL ANALYSIS
Facial medical analysis is a non-contact, non-invasive diagnostic method of TCM [23]. Basically, the first task in the computer-aided facial medical analysis is to detect and segment the facial components from face images. In [2], five facial regions (Forehead, Left cheek, Right cheek, Nose, Jaw) are segmented using skin detection, facial normalization, and horizontal position of the mouth, nostril, and eyebrow location. Hu et al. [6] adopt Gaussian Mixture Model (GMM) in lip segmentation. Li et al. [7] propose an end-to-end iterative tongue image matting network. Rot et al. [8] present a deep multi-class eye segmentation model build upon the SegNet architecture. As mentioned before, these methods only take separate face organs into account, resulting in inaccurate and biased diagnostic results. In this paper, we propose a hybrid architecture that can simultaneously detect and segment multiple facial components based on the principles of TCM holistic view.

B. SEMANTIC SEGMENTATION
Semantic segmentation is more and more being of interest for computer vision researchers. FCN [24] is a baseline for generic images which employs full convolution on the entire image to extract feature. Mask R-CNN [25] further advances the cutting edge of semantic segmentation through extending Faster R-CNN [26] and integrating RoIAlign. Mask Scoring R-CNN [27] extends Mask R-CNN with MaskIoU Head and achieves a new state-of-the-art result. However, directly applying these methods for face parsing may fail to model the complex yet varying spatial layout across face components, leading to unsatisfactory results.

1) CONTEXT AGGREGATION
One major group of works focus on context aggregation dependencies of local regions in the CRF framework [18],  [19]. Another group of works introduce sub-networks that can aggregate context inherently [28].

C. FACE PARSING
Face parsing, aiming to assign pixel-level semantic labels for face images, has attracted much attention due to its wide application potentials, such as: facial beautification [29], face image synthesis [30].

1) METHODS
Most existing approaches for face parsing can be categorized into two branches: global-based methods [31], [32] and localbased (hybrid) methods [11]- [13], [15]. Global-based methods predict semantic labels over the whole input image. Wei et al. [31] design automatically regulating receptive fields in a deep image parsing network. Zhou et al. [32] propose a network that employs super-pixel information and the CRF model jointly. Nevertheless, the accuracy of these kinds of methods is limited due to the lack of focusing on each individual part (see Table 10). In contrast, local-based methods train separated models for various facial components. Zhou et al. [11] design an interlinked CNN-based architecture which predicts pixel labels after facial localization. Liu et al. [12] propose a network that combines hierarchical representations learned by a CNN, and label propagations achieved by a spatially variant RNN. Lin et al. [13] propose a novel network combined with RoI Tanh-warping for face parsing. All of these approaches focus on general facial parsing tasks but ignore some relevant facial components that are essential for TCMI applications.

2) DATASETS
Although many face related fields have been well studied for many years, the existing datasets for face parsing are still severely limited. This is mainly because pixel-level annotation is a time-consuming work. The most commonly used public datasets for face parsing methods are LFW-PL [20] and HELEN [14]. LFW-PL dataset contains 2,927 face images. All the images are manually assigned to one of the hair/skin/background categories. The HELEN dataset contains 2,330 face images with manually labeled facial components including eyes, eyebrows, nose, inside mouth, lips, etc. However, for both datasets, the tongue in each image is not annotated.

D. RECURRENT NEURAL NETWORK
RNNs have been shown to be effective for modeling short and long term dependencies in sequential data. For images, we can apply 1-D RNN to multiple dimensions [21], [33] or multi-dimensional RNN (MDRNN) [34], [35] such that each neural node can receive informations from multiple directions. Visin et al. [21] adopt ReNet, which is a stacked of 1D-RNN to perform image classification. Based on ReNet model, an architecture for semantic segmentation called ReSeg [22] has been proposed. Both of them observe promising performance gains after incorporating RNNs.

III. THE PROPOSED TCMINET
We use a hybrid solution to estimate masks for face and inner facial components simultaneously. Given a cropped face image I , which contains only a single face in the center of the image, the Inspection Feature Extraction (IFE) module is deployed to capture dense feature maps F, which are later shared by Facial Inner Components Segmentation (FICS) and Face Segmentation (FS) branches. In FICS branch, for each inner facial component The features of each component within their bbox are mapped to a fixed resolution through PrRoI Pooling [36]. Next, C1 ↑↓←→ is adopted to model global contexts and reduce computational cost. At the end of the FICS branch, the segmentation masks is designed to link pixel-level and local information of F. Same as the FICS branch, pixel-wise VOLUME 8, 2020 segmentation mask M face is predicted in the end. Finally, we gather all segmentation masks and form the face parsing result as M .
As illustrated in Fig.1, we introduce the whole network with four temporal-consecutive functional modules: IFE, Component Bounding-box Prediction, Context Aggregation, and Feature Map Up-sampling. Next, we introduce each module in detail.

A. IFE MODUEL
The Xception model [37]- [39] has shown promising performance with fast computation. We work in the same direction to modify the Xception model for the task of face parsing. As illustrated in Fig. 2, max-pooling operations are replaced by depthwise separable convolutions, which allows efficient dense feature extraction on any arbitrary resolution. We use PReLU [40] as the non-linearity rather than ReLU since it allows negative responses that in turn improves the network performance (see Table 5). Furthermore, all of 3×3 depthwise convolution layers and 3 × 3 dilated depthwise convolution layers are followed by a BN and a PReLU activation.

B. FACIAL INNER COMPONENTS SEGMENTATION (FICS) BRANCH 1) COMPONENT BOUNDING-BOX PREDICTION MODULE
The semantic label of every inner facial component is explicitly defined in our work (e.g., left/right eye). Here we explicitly regress the area of each inner facial component instead of detecting them individually like in a Mask R-CNN-fashion [25], [27]. The prediction module consists of two convolutional layers followed by a global average pooling and a fully connected layer. It avoids ambiguities in components and reduces computation cost. The component prediction module locates bounding-boxes of the N inner facial components: . We adopt the L 1 loss for the boundingbox regression. The regression loss L reg is defined as: It stands to reason that the low accuracy of the regressed bounding-boxes usually leads to the poor performance of the segmentation. Through experiments, we observe that some part of the targets may fall outside the bounding-boxes, especially for lips. To mitigate this problem, we add paddings outside the bounding-boxes to solve the problem. Specifically, regressed bounding-boxes are padded by 20% the feature map size for lips and 10% for other components. The optimized bounding-boxes yield good hints for predicting high accuracy masks (see Table 6).

2) CONTEXT AGGREGATION MODULE (C1 ↑↓←→ )
In order to parse face images robustly, effective contextual modeling is more demanding. For inner facial components lips, tongue}, we first use PrRoI Pooling [36] to map the features of each component to a fixed resolution individually. We feed the resulting feature maps E i into recurrent layers for fine-tuning. As depicted in Fig.1, each recurrent layer is composed by four RNNs. Specifically, we take a feature map E i of elements e ∈ R A×B×C , where A, B and C are respectively the height, width and number of channels and we split it into K × L patches p k,l ∈ R A p ×B p ×C . First, we sweep the image vertically with two RNNs (o ↓ and o ↑ ). Each RNN reads the next non-overlapping patch p k,l based on its previous state, emits a projection q ↓ k,l (or q ↑ k,l ) and updates its state r We concatenate projections q ↓ k,l and q ↑ k,l to obtain feature map Q . Then we sweep over each of its rows with two RNNs (o ← and o → ). With a similar but specular procedure as the one described before, we obtain a concatenated feature map Q ↔ . Each element q ↔ k,l represents the features of patches p k,l with contextual information from E i . To sum up, the context aggregation module sweeps over feature maps E i horizontally and vertically, and providing relevant global information.

3) FEATURE MAP UP-SAMPLING MODULE (F1)
All component's feature map up-sampling modules share the same network architecture but have independent weights. Each component segmentation module is built with two 3 × 3 convolutions each followed by one bilinear up-sampling. For the obtained N − 1 bounding-boxes, N − 1 light and parallel feature map up-sampling modules are used to predict the masks for each inner facial component. We use the pixelwise cross-entropy to measure the component segmentation accuracy. The segmentation loss L seg1 is defined as the averaged cross-entropy among all the segmentation networks: Different from chain-structured sequential data, the connectivity structure of image units are beyond chain. The graphical representations (e.g., UCGs) that respect the 2-D neighborhood system are more plausible solutions for spatial arrangement of image units. However, due to the loopy structure of UCGs, RNNs can't be directly applied to UCG-structured images. We decompose the UCG U to a set of complimentary DAGs: U = D d . As exemplified in Fig.3, we use the four context propagation directions (D 1 , D 2 , D 3 and D 4 ) to decompose U. The topology of feature maps F is represented as DAG D = {V, E}, where V = {v i } i=1:N is the vertex set and E = e ij is the arc set. The topology of the hidden layer h d follows the same topology as D. Therefore, a forward propagation sequence can be generated by traversing D. These operations can be mathematically expressed as follows: where M d , V d , and W d are weight matrices, and b d is bias vector. Here, x v i , h v i , and o v i are the representations of input, hidden and output layers located at v i , respectively. P D d (v i ) is the direct predecessor set of vertex v i in D d . g and k are composition functions. We place C2 on top of the IFE module to capture the rich contextual dependencies over image regions.

2) FEATURE MAP UP-SAMPLING MODULE (F2)
For the face, we perform several convolutions and upsampling operations to generate the mask M face (M g face is the groundtruth). We also use the cross-entropy loss to constrain the segmentation accuracy. The segmentation loss L seg2 is defined as: Finally, all the resulting segmentation masks are gathered. We form the final face parsing result, denoted as M . Since the component segmentation relies on a good component bounding-box regression, we divide the training process into two steps. In the first step, we only train the IFE module and the component bounding-box prediction module for good component regression accuracy. Here, only L reg is used for training. In the second step, we perform joint training by updating all parameters with L reg , L seg1 , and L seg2 together.

IV. DATASETS
The dataset with diverse images and well-labeled masks is an important reason for the continuous improvement of face parsing algorithms, especially for deep learning-based technologies. To the best of our knowledge, there are only a few public face parsing datasets, such as the LFW-PL [20] and HELEN [14], where the hair area is considered as an essential semantic category for parsing. Especially, images in both datasets are taken in a random environment, and the tongue in each image is not annotated. The lack of accurate annotated datasets becomes a major obstacle in the progress of face parsing for TCMI. To fill the gap, we construct a novel dataset named TCMID, in which the tongue is regarded as one of the most critical semantic categories.

A. DATA COLLECTION
We collect 1500 face images in JPG format. The facial image acquisition system is the same as [5]. Table 1 shows the composition of our dataset. Besides, each image is provided with accurate annotation of a 6-category (face, left eye, right eye, lips, tongue, and background) pixel-level label map subjectively labeled by TCM practitioners. These images are split into the training and test sets with 1100 and 400 images, respectively.

B. IMAGE DIVERSITY
Face images in our dataset display large variations in (foreground) facial complexion, lip color, eye state, etc (see Table 2). As demonstrated in Fig.4, the patient's tongue (substance and coating) color, facial gloss, and other conditions VOLUME 8, 2020 are greatly varied. The tongue substance color is usually reddish colors, and the tongue coating color is normally white, gray, or yellow. The tongue color gamut is highly overlapping with lip (face) color gamut. Fig.5 shows facial images with different head poses (rotation). As demonstrated in Fig.6.(c), in addition to the face and target inner facial components, typical face images inevitably contain many non-target components, such as hair, beard, teeth, and the inner tissue of the mouth. The different states (open, halfopen, closed) of the eyes and mouth are shown in Fig.6.(a) and Fig.6.(b), respectively. Moreover, the patient opens the mouth wide with the tongue sticking out, and the lower lip is partially (or totally) blocked by the tongue. We include such large variations in TCMID to make our model more robust to challenging inputs.   used. Horizontal flipping of images with probability 0.5 is used. (2) Gamma adjustment. We apply four different Gamma transforms to increase color variation. The Gamma values are {0.5,0.8,1.2,1.5}. (3) Background replacement. We first utilize a image matting network [42] to get the foreground (face) region. Then we randomly replace the background with non-face images [43], [44] or pure colors (e.g., deep red, light red, purple, red, white, yellow, gray).

D. OTHER DATASETS
Furthermore, we manually relabel some images of HELEN and LFW-PL datasets as challenge cases for test. It is worth noting that only a small number of face images in these two datasets conform to the TCMI face image-standard: the patient opens the mouth wide with the tongue sticking out. We selected face images that meet the standard in HELEN   Table 3).

V. EXPERIMENTS
In this section, ablation studies and exploratory experiments on TCMID are carried out to discuss the hybrid network structure and several important modules of the proposed architecture. Then we test our network on the HELEN , LFW-PL , and TCMID datasets. Experimental results show that our model achieves the best results over other state-ofthe-art methods on three datasets.

A. PERFORMANCE EVALUATION METRICS
Similar to [11]- [14], we use F-measure for each class as basic evaluation metrics. Besides, we quantitatively evaluate and compare our model with existing face parsing methods and semantic segmentation methods using evaluation metrics: Accuracy, Precision, Recall, F-measure, and their corresponding standard deviations. The metrics are defined as follows: where TP denotes the number of true positive pixels, TN denotes the number of true negative pixels, FP stands for the number of false positive pixels, and FN represents the number of false negative pixels. The F-measure is the harmonic mean of Precision and Recall.

B. ABLATION STUDY
We quantitatively evaluate and compare our ablation models using Accuracy and F-measure metrics on TCMID dataset. The performances are reported in the form of F-measure for each class, mean F-measure over the five foreground categories (le, re, lips, tongue, face), and average Accuracy (Acc.). Herein, le is short for the left eye, re is short for the right eye, bg is short for the background.

1) IMPORTANCE OF THE HYBRID NETWORK STRUCTURE
We use a hybrid (local-based) strategy to train separated branches for face and detailed inner facial components. As explained in Section II.C, global-based methods directly predict the per-pixel semantic label over the whole face image. Table4 illustrates the advantages of hybrid structures over global-based structures in F-measure and average Accuracy. Experimentally, the accuracy of global-based methods [11], [14], [31] is limited due to the lack of focusing on each individual part.

C. COMPARISON WITH STATE-OF-THE-ART METHODS
We perform a thorough comparison between our model and existing state-of-the-art (face parsing and semantic segmentation) methods on LFW-PL , HELEN , and TCMID. In our task, the foreground (le, re, lips, tongue, face) regions are much more important than the background region, so we calculate the mean F-measure over five foreground categories. As Table 10 shows, our model achieves the best results in F-measure over other state-of-the-art (face parsing and semantic segmentation) methods on three datasets (all categories). The average Accuracy, Precision, Recall, F-measure, and their corresponding standard deviations  metrics for all the methods on three datasets are displayed in Table 11.

1) COMPARISON WITH FACE PARSING METHODS
As mentioned before, existing face parsing methods [11]- [14], [31] mainly focus on segmenting hair, eyebrows, and other facial components that are rarely relevant to TCMI, rather than segmenting the tongue required for TCMI applications. The F-measure scores of six categories, and the mean F-measures over five foreground categories on three test datasets are presented in  Table 10). Table 11 demonstrates further performance comparisons of the proposed method with other existing face parsing approaches. As observed, our method gets the best result on three test datasets in terms of Accuracy, Precision, Recall, and F-measure, which demonstrate the effectiveness of the proposed TCMINet.

1) LFW-PL AND HELEN
We evaluate our approach on LFW-PL and HELEN datasets. Experimentally, our model shows a good generalization ability on these two challenging datasets (see Table 10). Fig.9 and Fig.10 show the qualitative parsing results on LFW-PL and HELEN dataset, respectively. The ground truth label maps are also shown.

2) TCMID
Our model is robust to challenging inputs. As shown in Fig.11 and Fig.12, the proposed TCMINet is suitable for segmenting face and inner facial components with varying appearances (e.g., tongue substance color, tongue coating color, lip color, facial gloss, and face color) or states (e.g., head rotation, mouth state, eye state, and interference).

A. SIMULTANEOUS SEGMENTATION OF MULTIPLE FACE ORGANS
Facial medical analysis is a non-invasive, non-contact diagnostic method of TCM. Generally, segmenting facial skin facial and sensory organs from face images is the first step in computer-aided facial medical analysis. According to related literature, there have been a large number of researches focus on detecting and segmenting single face organ or facial skin. For instance, Pang et al. [46] proposed the Bi-Elliptical Deformable Contour (BEDC) model for automated tongue area segmentation. In our previous work [45], we proposed a real-time tongue image segmentation method for remote diagnosis (see Fig13.(b)). Zhao et al. [2] develop a facial region segmentation method to partition the facial skin into five specific regions (see Fig13.(c)). In [48], four facial skin blocks in TCM are automatically extracted from each halfface image (see Fig13.(d)). In [47], a cheek region extraction method has been proposed for face diagnosis. (see Fig13.(e)).  These methods mainly focus on exploring the important role of facial skin regions in reflecting the health status of patients   while ignoring the criticality of inner facial components (e.g., eyes, lips, tongue). Our work is the first attempt to deal with face and multiple inner facial components simultaneously based on the principles of TCM holistic view (see Fig.13.(f)).

B. FACE PARSING FOR TCMI
As mentioned in SectionI, previous face parsing methods mainly focus on segmenting hair, eyebrows, and other facial components that are rarely relevant to TCMI. In TCMI, the human face and facial sensory components are believed to reveal signs of various constitutions. The tongue among them, as the primary organ of gustation, conveys abundant valuable information about the diseases of the internal body. In our work, face parsing for TCMI amounts to labeling each pixel with the left eye, right eye, lips, tongue, face, and background following the principles of TCM holistic view. Experimentally, our proposed TCMINet outperforms state-of-the-art methods on LFW-PL , HELEN , and TCMID datasets under different evaluation metrics.

C. LIMITATIONS AND FUTURE WORKS
(1) In-the-wild and multi-face conditions: as shown in Table 10 and Table11, our model achieves better performance on the TCMID than on LFW-PL and HELEN . Although our proposed model is suitable for segmenting faces and inner facial components with varying appearances or states, it cannot deal with multiple faces in field conditions. In future work, we will further extend our architecture to handle different face instances under various environments.
(2) Multiple facial specific regions: our proposed model achieves simultaneous segmentation of the face and inner facial components. However, based on the principle of TCM, the human face can be roughly partitioned into multiple specific regions by connecting specific landmarks. Different regions can reflect the health status of different internal organs. In the future, we plan to explore the multi-task learning architecture to achieve multiple facial specific regions partition.

VII. CONCLUSION
In this paper, we propose an effective hybrid network of face parsing for TCMI with context aggregation. Ablation studies show the effectiveness of our hybrid structure and important modules. The superior performances on LFW-PL , HELEN , and the proposed TCMID datasets show the ability of the proposed TCMINet to handle the problem of face parsing for TCMI. Most importantly, our TCMINet can handle faces and all the inner facial components with various appearances, e.g., color, and states, e.g., head rotation, providing new insights into TCMI research and development.