Patch-Driven Tongue Image Segmentation Using Sparse Representation

Tongue diagnosis plays a key role in TCM (Traditional Chinese Medicine) diagnosis. Tongue image segmentation lays a solid foundation for quantitative tongue analysis and diagnosis. However, the segmentation of tongue body is challenging due to the factors such as large personal variation of tongue body on color, texture and shape, as well as weak edges caused by similar color between tongue body and neighboring tissues, especially the lip. Existing segmentation methods usually use only single color component and simple prior knowledge, thus leading to inaccuracy and instability. To alleviate these issues, a patch-driven segmentation method with sparse representation is proposed in this paper. Specifically, each patch in the testing image is sparsely represented by patches in the spatially varying dictionary, which is constructed by the local patches of training images. The derived sparse coefficients are then employed to estimate the tongue probability. Finally, the hard segmentation is obtained by applying the maximum a posteriori (MAP) rule on the tongue probability map and further polished with morphological operations. The proposed method has been extensively evaluated on a tongue image dataset including 290 subjects using 10-fold cross-validation, as well as additional 10 unseen testing subjects. The proposed method has achieved more accurate segmentation results, compared with the state-of-the-art methods.


I. INTRODUCTION
Tongue diagnosis [1], [2] aims to draw physiological and pathological cues according to tongue body's features including color, texture, shape, coating and so on. Tongue diagnosis has the virtues of effectiveness, painlessness, no side-effects, simplicity and immediacy, which make it be one of the most important and widely used diagnostic methods in the TCM. After practices over 3000 years, and TCM practitioners have accumulated very rich clinical experiences such as eight classic tongue diagnosis principles [2] revealing strong relationships between tongue body's sub-regions and human organs' health statuses. For examples, tongue The associate editor coordinating the review of this manuscript and approving it for publication was Leyi Wei. coating is usually strongly related with the health status of stomach. In addition, tongue's appearance can be used to monitor a patient's health status. In detail, the color and texture of tongue coating usually reflect many diseases and human health statuses, such as inflammation, infection, and stress [3]. The researchers of TCM tongue diagnosis have also explored value of tongue diagnosis in diagnosing diseases, such as cancer [4], gastritis, and precancerous lesions [5].
In the modern Western medicine, some literatures [6], [7] named tongues with discolored regions or cracks as ''geographic tongue'', and validated the value of tongue diagnosis for making clinical decision [7]. For example, tongue coating exhibits strong relationship with viable salivary bacteria [8], which indicates a risk of aspiration pneumonia in edentate patients. Amyloidosis of the tongue may be used to diagnose plasmacytoma [9]. The research [10] explored connection between tongue diagnosis and tongue-coating microbiome, systematically investigating the biological bases of different tongue-coating appearances at the molecular level. The above examples exhibit the potential of tongue diagnosis for inferring systemic disorders.
However, the accuracy of conventional tongue diagnosis highly depends on the practitioners' experiences, and the diagnosis has subjective and time-consuming limitations. To solve these issues, some image processing and pattern recognition techniques were used to achieve automatic tongue diagnosis [11]. This type of automatic diagnosis usually first uses image segmentation technique to extract tongue body in a tongue image, then uses feature extraction technique to calculate tongue body features, and finally utilizes a classifier to achieve final diagnosis. Hence, tongue image segmentation is a crucial step of achieving automatic tongue diagnosis. Some works [12]- [16] have been done to resolve this problem. However, tongue image segmentation is currently still challenging due to large personal variation of tongue body appearances, and weak edges mainly caused by similar color between tongue body and lip.
In recent years, patch-based methods draw more and more attentions in the fields of computer vision and image processing, such as label fusion and segmentation [17]- [20], texture synthesis [21], in-painting [22], image denoising [23], [24], and super-resolution reconstruction [25]. For example, Coupé et al. [17] presented a patch-based label fusion method termed as the nonlocal-means method for segmentation of hippocampus and ventricle. This patch-based method is widely used due to its simplicity and good segmentation performance. Recently, patch-based sparse representation has also attracted increasing interest in the field of image processing. These patch-based methods assume that each patch (i.e., local window) in an image can be sparsely represented by a linear combination of image patches in an over-complete dictionary [26]- [32]. Motivated by the nonlocal-means method [17] and the sparse representation techniques, we propose a novel patch-based sparse representation method for tongue image segmentation in this paper. For the patch-based segmentation, the differences between the nonlocal-means method [17] and our proposed method include two aspects: 1) the nonlocal-means method [17] presented a similarity measure in grayscale space to preselect the training patches. The proposed method extends the similarity measure in [17] from grayscale space to RGB color space.
2) The nonlocal-means method took the weighted average of training pixels' labels as the probability of the testing pixel belonging to the object (tongue body), where the weight of a training pixel is defined as the similarity between its corresponding training patch and the testing patch. Different from the nonlocal-means method, the proposed method uses sparse coefficients obtained by sparse representation to calculate the probability of the testing pixel belonging to the object. In summary, for the nonlocal-means method, all the remaining training patches after patch pre-selection contribute to the estimation of object's probability, regardless of their similarity to the testing patch. However, for the proposed method, only the training patches with non-zero sparse coefficients in the dictionary contribute to the estimation of object's probability. Specifically, the proposed method first uses the pre-selected local patches of training images (through similarity-based patch pre-selection) to construct a dictionary. Then, for each pixel in a testing image, its corresponding patch can be sparsely represented by the patches in the dictionary, and thus this patch will have similar segmentation label as the selected patches in the dictionary according to their respective sparse coefficients. By visiting each pixel in the testing image and computing the respective probability of corresponding image patch, a tongue probability map can be generated for the entire testing image. Lastly, the final hard segmentation can be obtained by applying the maximum a posteriori (MAP) rule on the tongue probability map and further polishing with morphological operations. Experimental results on 290 subjects and additional 10 unseen subjects show that the proposed method can dramatically improve both segmentation accuracy and robustness.
The rest of this paper is organized as follows. Related work is introduced in Section II. Theory and implementation of the proposed method is described in Section III. Experimental results are reported in Section IV. Conclusions are drawn in Section V.

II. RELATED WORK
Due to the above challenges of tongue image segmentation, conventional simple low-level image processing techniques such as edge detection fail to obtain satisfactory segmentation result. To improve accuracy of tongue image segmentation, some hybrid tongue image segmentation works were developed, where ACM (Active Contour Model) based methods are popular. ACM also called snake [33] evolves a given initial curve near to true object contour under controls of internal and external forces.
ACM-based methods mainly study on initial contour determination and curve evolution. For example, Pang et al. [13] presented a bi-elliptical deformable contour extraction method, which is termed as BEDC. The method [13] first defined a specific bi-elliptical deformable template (BEDT) as a rough description of tongue body, then obtained the initial tongue contour by minimizing the BEDT energy function, and finally used a modified active contour model of replacing the conventional internal force with the template force to evolve the initial contour and obtain the final segmentation result. BEDC [13] has two limitations. One limitation is that BEDC used only the red component of a tongue image. As pointed out in [15], color is usually the most important feature to distinguish tongue body from image background. But single color component cannot adequately use color information, thus easily generating the confusion between tongue body and its neighboring tissues. This further adds the difficulty of tongue image segmentation on the weak edges. The other limitation is that BEDC [13] simply used two semi-ellipses to model the shape of tongue body. However, the rough modeling and the large variation of tongue shape may make the initial tongue contour contain the undesirable strong edges from neighboring tissues, thus converging far away from the true tongue contour.
On the other hand, Zhang et al. [14] used ACM and polar edge detection to achieve tongue image segmentation. Specifically, it first detects tongue body boundaries via polar edge detection, and removes fake tongue body boundaries via an edge mask. Then, it performs a local adaptive bi-thresholding and morphological filtering for converting the edge filtering result to a bi-level polar edge image. Finally, it obtains the initial tongue body contour in a heuristic way, and refines the contour via ACM. But, the method failed to resolve two main challenges in tongue image segmentation, i.e., 1) how to extract weak tongue body contour caused by color similarity between the tongue body and its surroundings, and 2) how to distinguish edges belonging to tongue body contour from other edges caused by strong textures or tongue body coating. To remove non-tongue body edges, the method [14] used Sobel operator, Gaussian filter, image thresholding, and morphological operation. But the scheme is invalid for those long non-tongue body edges. In addition, Gaussian filter can often weaken the true tongue contour, thus making the first challenge more difficult to solve.
Ning et al. [15] used gradient vector flow (GVF), region merging (RM), and ACM to extract tongue body. The method is briefly termed as GVF-RM. Specifically, this method first improved GVF for suppressing both noise and trivial image structures, then obtained many small image regions via watershed segmentation, and finally performed region merging to construct the initial tongue body contour for subsequent ACM-based contour refinement. GVF-RM [15] has three limitations. First, it used only red component during obtaining the initial contour of tongue body. Second, when performing initial tongue body contour determination based on region merging, it obtained object (tongue body) region marker and background region marker under an assumption that tongue body and background are at image center and image border, respectively. However, when tongue body is very close to image border, the wrong background marker may cause the above assumption be invalid, easily generating bad segmentation result. Third, when it improved GVF to suppress some noise and trivial image structures, it also weakened true tongue body contour.
Furthermore, the above ACM-based methods are usually sensitive to the initial object contour. Once the initial object contour includes fake strong edges from neighboring tissues such as neck, cheek, face, wrinkle and lip, the initial contour is difficultly converged to true boundary of tongue body.

III. METHOD
To address limitations of existing tongue image segmentation methods, inspired by patch-based sparse representation, this paper proposes a novel patch-driven method, and Fig. 1 shows flowchart of the proposed method. The contributions of our proposed method are as follows: (1) We innovatively introduce a patch-based sparse representation into tongue image segmentation. The proposed method adequately uses prior information from the training tongue images with manual segmentations, thus dramatically improving both segmentation accuracy and robustness.
(2) We develop a novel color information based criterion to evaluate the similarity between a testing patch and a training patch, and then use it to pre-select training patches for reducing the size of the dictionary in sparse representation and also saving the computational time.

A. DICTIONARY CONSTRUCTION
Our proposed method converts the problem of tongue image segmentation as a pixel classification task. Specifically, it uses a patch-based sparse representation to calculate the probability of each testing pixel belonging to the object (tongue body), and then obtains the final segmentation result by subsequent morphological operations. During conducting the patch-based sparse representation, the size of the overcomplete dictionary, constructed by training image patches, could seriously affect the computational time. Also, including those confusing patches into the dictionary could affect the sparse representation result and eventually the segmentation result. To address this issue, we design a new similarity measure to perform a pre-selection of training patches for each testing patch. The designed new measure extends the conventional similarity measure [17] (evolved from the well-known structural similarity measure (SSIM) [34]) from grayscale space into color space. The definition of our new similarity measure is as follows: where S x and P y denote the testing patch and the training patch centered at the locations of pixel x and pixel y, respectively. c indicates the index of the RGB color components, where each component has 8-bit depth with the intensity range of [0 255]. µ and σ represent the mean and standard deviation of the patch from the corresponding color component. High SIM value means the high similarity between S x and P y . The above measure evaluates the similarity between the testing patch S x and the training patch P y , which can be also used to pre-select training patches. The criterion of our training patch pre-selection is formulated as: where 1 means that the training patch P y has high similarity with the testing patch S x , thus P y should be included into the dictionary for sparse representation. Conversely, 0 indicates low similarity between patches P y and S x , thus P y should be excluded. The parameter th was set to 0.95 for all the experiments according to the literature [17]. Note that, to avoid repeated computations and save computational time, both mean and variance of each patch can be pre-calculated as maps of local means and local variances.
To construct a patch dictionary for each pixel x in a testing image I , the color patch centered at the pixel x (taken from w × w neighborhood and the RGB color components) is first columnized to be a 3w 2 dimensional vector, and this vector is then normalized to have the unit L2 norm [30], denoted as S x . Furthermore, the patch dictionary for the pixel x can be adaptively built from all training images with manual segmentation maps as follows. First, letting i (x) be the w l × w l neighborhood of pixel x in the i-th training image, we can obtain the patch for each pixel y ∈ i (x) from the training image, and pre-select all patches from w l ×w l neighborhoods of all training images according to the criterion defined in Eq. (2). Then, each selected training patch is columnized and sequentially normalized as a 3w 2 dimensional column vector P y with the unit L2 norm. Finally, the normalized column vectors of all selected training patches are stacked to be a 3w 2 × N matrix as the dictionary, D x , for sparse representation, where N denotes the total number of selected training patches. For better clarity, Fig. 2 shows the flowchart of dictionary construction, where x indicates the testing pixel, yellow squares in training images are the searching neighborhoods centered at the location of x, and squares with other colors in training images are the training patches.

B. SPARSE REPRESENTATION
To represent the testing patch S x by the dictionary D x , the sparse coefficient vector α can be estimated by minimizing the following non-negative Elastic-Net problem [35], min α α≥0 In Eq. (3), the first term is the data fitting term, the second term is the L1 regularization term enforcing sparsity of α, and the last term is the L2 smoothness term enforcing similarity of coefficients for similar image patches. Eq. (3) is a convex combination of L1 lasso [36] and L2 ridge penalty, which encourages a grouping effect while keeping a similar sparsity of representation [35]. Our implementation used the LARS algorithm [37] in the SPAMS toolbox (http://spamsdevel.gforge.inria.fr) to solve the Elastic-Net problem. Each element of α, i.e., α i , reflects similarity between the testing patch S x and the training patch P i y in D x [38]. Based on the assumption that similar patches should share similar segmentation labels, sparse coefficients in α are used to estimate probability of the pixel x belonging to the object (tongue body):  where is a normalization constant, and L(P i y ) is the manual segmentation label of the central pixel of the i-th training patch P i y , where 1 and 0 are used for the object label and the background label, respectively.
After calculating each testing pixel's probability, a probability map can be obtained. Fig. 3 shows the probability maps obtained for eight representative tongue images. These eight images represent large variation of tongue body on shape, size, color, texture, and coating. For example, Figs. 3(a)-(b) show the representative examples with large tongue shape variation. From Fig. 3, it can be seen that the proposed method generates accurate probability maps. To convert a probability map into a hard segmentation, the label of the pixel x can be determined by using the maximum a posteriori (MAP) rule. As shown in Fig. 4, the hard segmentation (Fig. 4(c)) is converted from the estimated probability map (Fig. 4(b)). However, there may be some artifacts, such as isolated faked objects and holes as indicated by the blue and red arrows in Fig. 4(c).
These artifacts can be easily corrected by morphological operations. For example, those isolated faked objects can be removed by the area opening operation, while holes can be filled by the morphological closing operation. Final segmentation result is shown in Fig. 4(d) after performing these morphological operations.  [42]. Four modules, i.e., illuminant, lighting path, imaging camera, and color correction process, are essential for accurate and consistent image acquisition.

IV. EXPERIMENTAL RESULTS
To validate effectiveness of the proposed method on tongue image segmentation, we done extensive experiments on a color tongue image dataset including 290 subjects by 10-fold cross-validation, as well as additional 10 unseen subjects. Segmentation results on 8 representative tongue images are first used to perform qualitative comparison between the proposed method and state-of-the-art methods, i.e., GVF-RM [15] and a popular patch-based label fusion method called nonlocal-means [17]. Then, segmentation results on the entire tongue image dataset are used to perform quantitative comparison evaluated by 4 common classification measures [39]- [41]. And finally, segmentation results on additional 10 unseen images are qualitatively and quantitatively compared.
Based on the highest average kappa index (KI) value [41] on the entire image dataset, GVF-RM determined the optimal iteration number for GVF-based image diffusion, and then taken corresponding results as final segmentation results for qualitative and quantitative comparison with other methods. Other parameters of GVF-RM follow the original literature [15]. The nonlocal-means label fusion method [17] shares the same patch pre-selection and experimental setting of 10-fold cross-validation as the proposed method.

A. TONGUE SUBJECTS AND DATA ACQUISITION
Tongue images were acquired by a high quality and consistent tongue imaging system [42] with the flowchart given in Fig. 5, which was designed by the Hong Kong team of Prof. David Zhang. Each image has the size of 160 × 120 and the resolution of 1 × 1 mm 2 in our experiments. The manual segmentations were performed by the collaborators of the Hong Kong team, i.e., Dr. Naimin Li and his team from Harbin Binghua Hospital.

B. EVALUATION METRICS
To quantitatively evaluate image segmentation accuracy, we used 4 classification measures, i.e., misclassification error (ME) [39], false positive rate (FPR), false negative rate (FNR) [40], and kappa index (KI) [41], where B m and F m indicate background and foreground in manual segmentation result (ground truth), B a and F a indicate background and foreground in an automatic algorithm's segmentation result, and | · | is cardinality of a set. The values of ME, FPR, FNR, and KI are between 0 and 1, lower values of the front three measures, and higher KI value indicate better segmentation accuracy.

C. IMPACTS OF THE PARAMETERS
The proposed method has four important parameters, i.e., w, w l , λ 1 and λ 2 . Parameters w and w l denote the patch size and the size of neighborhood, respectively. Parameters λ 1 and λ 2 are the weights of L1 regularization term and L2 smoothness term in sparse representation. Four parameters were adaptively determined via 10-fold cross-validation in our experiments. To clearly discuss the impacts of four parameters on the segmentation accuracy of the proposed method, we fixed two parameters and varied other two parameters. Specifically, we first empirically fixed λ 1 and λ 2 in Eq. (3) to 0.1 and 0.01, and evaluated the impacts of w and w l on the segmentation accuracy. Then, we fixed w and w l to 7 and 19, respectively, VOLUME 8, 2020 according to the evaluation results of w and w l , and tested the impacts of λ 1 and λ 2 on the segmentation accuracy.
To observe the impacts of the parameters on the segmentation accuracy, we specifically recorded the average kappa index values of all the images from all the validation sets in 10-fold cross-validations under different parameter combinations. First, we studied the impacts of patch size (w) and search neighborhood size (w l ) on the segmentation accuracy, where w is selected from {3, 7, 11, 15, 19}, and w l is selected from {7, 11, 15, 19}. The kappa index results are provided in Fig. 6(a) and Fig. 6(b), respectively. Fig. 6(a) shows that the best average kappa index values were obtained with a patch size of 7×7 for each fixed w l , where the best values are 0.961, 0.964, 0.967, and 0.969 for w l = 7, 11, 15 and 19, respectively. The optimal patch size seems to reflect the complexity of the tongue structure. That is, the patch size needs to be large enough for description of tongue structure, but should not be too large to miss the localization. Fig. 6(b) shows that the best average kappa index values were obtained with a search neighborhood of 19 × 19 for each fixed w, where the best values are 0.966, 0.969, 0.968, 0.966, and 0.964 for w = 3, 7, 11, 15, and 19, respectively. As expected, the performance is improved with the increase of the size of w l .
The accuracies under other combinations of λ 1 and λ 2 are significantly improved, with the best accuracy obtained with λ 1 = 0.1 and λ 2 = 0.01. This demonstrates the importance of using both L1 regularization term and L2 smoothness term in Eq. (3). On the other hand, there is no significant difference among different combinations of λ 1 and λ 2 , except λ 1 = 0 and λ 2 = 0, indicating the robustness of our proposed method to the values of λ 1 and λ 2 . However, the above ranges for four parameters were empirically chosen in our experiments, which could lead to local minimum results.

D. THE IMPACT OF COLOR SPACE
This section discusses the impact of color space on the segmentation accuracy of the proposed method. First, we studied the impacts of different color spaces on the segmentation accuracy, where color spaces include RGB and CIELAB. When applying the proposed method to LAB color space, we first converted each image from RGB to LAB by using the inherent function ''rgb2lab'' in MAT-LAB 2015a. Then, we performed experiments on the entire image dataset in LAB color space, including 290 subjects, by 10-fold cross-validation. Our experiments in LAB color space share the same patch pre-selection and experimental setting of 10-fold cross-validation as the experiments performed in RGB color space. The only difference is to use LAB color space, instead of RGB color space. The results by LAB and RGB color spaces are shown in Fig. 7(a). The mean and standard deviation of kappa index (KI) values in LAB and RGB spaces are 0.960 ± 0.036 and 0.972 ± 0.023, respectively. These quantitative results show that the proposed method obtains significantly higher segmentation accuracy (p-value = 7.557 × 10 −5 ) and also the stronger stability in RGB color space, in comparison with LAB color space. Fig. 7(b) shows the comparison of average KI values obtained by applying the proposed method to the entire image dataset in both grayscale space and RGB color space by 10-fold cross-validation. The mean and standard deviation of KI values obtained by the proposed method in the grayscale space and RGB color space are 0.934 ± 0.055 and 0.972 ± 0.023, respectively. These quantitative results show that the usage of color information indeed improves the segmentation accuracy of the proposed method.

E. QUALITATIVE COMPARISON
To qualitatively compare segmentation effects of three methods, Fig. 8 shows their segmentation results on the eight representative tongue images with large variation of tongue body on shape, size, color, texture, and coating. GVF-RM obtains satisfactory segmentation result only on Fig. 8(e), but generates misclassification on all other images. Specifically, GVF-RM suffers from heavy misclassification on Fig. 8(a)

F. QUANTITATIVE COMPARISON
To quantitatively compare segmentation accuracy of GVF-RM, nonlocal-means method, and the proposed method, we evaluated them on the entire dataset using four measures including ME, FPR, FNR and KI. Fig. 9 shows boxwhisker plots of the four measures. Furthermore, the mean and standard deviation of ME values obtained by three methods are 0.091 ± 0.091, 0.044 ± 0.035 and 0.012 ± 0.010, respectively. The mean and standard deviation of FPR values obtained by three methods are 0.071 ± 0.105, 0.029 ± 0.027 and 0.006 ± 0.008, respectively. The mean and standard deviation of FNR values obtained by three methods FIGURE 8. Visual inspection of segmentation results on the eight representative tongue images. Columns 1-5: original images, manual segmentation results (ground truths), and segmentation results by GVF-RM [15], the nonlocal-means method [17], and the proposed method. are 0.153 ± 0.165, 0.086 ± 0.093 and 0.031 ± 0.030, respectively. These quantitative results show that the proposed method has the lowest misclassification error and strongest stability. For kappa index (KI), three methods achieve segmentation accuracies of 0.806 ± 0.162, 0.899 ± 0.078 and 0.972 ± 0.023, respectively, which again demonstrates superiority of the proposed method.

G. RESULTS ON 10 UNSEEN SUBJECTS
Besides the 10-fold cross-validation experiments on the image dataset including 290 subjects, we further validated our proposed method on additional 10 unseen subjects, with their manual segmentations provided by Dr. Naimin Li from Harbin Binghua Hospital. Fig. 10 shows segmentation results on three of these 10 unseen images. As can be observed that FIGURE 9. Box-whisker plots of misclassification error (ME), false positive rate (FPR), false negative rate (FNR), and kappa index (KI) values obtained by applying GVF-RM [15], the nonlocal-means method [17] and the proposed method to the entire dataset.

FIGURE 10.
Visual inspection of segmentation results on three unseen tongue images. Columns 1-5: original images, manual segmentation results (ground truths), and segmentation results by GVF-RM [15], the nonlocal-means method [17], and the proposed method. the proposed method obtained higher segmentation accuracy than GVF-RM [15] and the nonlocal-means method [17], by referring to the manual segmentation results as ground truths. The average kappa index values on 10 subjects by different methods are shown in Fig. 11, which again demonstrates the advantage of the proposed method.

H. TIME COMPLEXITY
The total complexity of the proposed method is about O(n 1 (3w 2 · w 2 l + N 1 · w 2 l + T s + m 1 )), where n 1 is the pixel number of a testing image, w × w is the patch size, w l × w l is the search neighborhood size, N 1 is the number of training images, T s is the time complexity of a testing pixel's local patch sparsely represented by an over-complete dictionary using sparse representation technique, and m 1 is the number of the remaining training patches after patch pre-selection and m 1 N 1 · w 2 l . To compare computational efficiency of three methods, we also specifically recorded their average computational times on the entire tongue image dataset with image size of VOLUME 8, 2020 160 × 120. The average computational time of GVF-RM is around 0.2 minutes, while the average computational times of the nonlocal-means method and the proposed method are around 3 minutes and 5 minutes, respectively, under the parameter combination of w = 3 and w l = 7 in our Linux server with 1 CPU and 2G memory. Under this parameter combination, the proposed method achieves the accuracy of 0.964 ± 0.025, measured by kappa index (KI). As for the computational time of the proposed method, on average, it takes around 2.8 minutes to construct the over-complete dictionary and then additional 2 minutes to perform sparse representation. The rest of the time is consumed by subsequent morphological operations. The proposed method is slower than other two methods, especially the GVF-RM. Fortunately, the running time of the proposed method can be significantly shorten by some parallel computing schemes such as modifying the program structure of our algorithm coded by MATLAB, and taking an image segmentation task into multiple sub-image segmentation tasks by decomposing an image into multiple sub-images. For example, when calculating the probabilities of all testing pixels belonging to the object (tongue body) in a testing image, if we respectively decompose the testing image, each training image and corresponding manual segmentation result into 16 subimages, the running time of the proposed method segmenting an image is about 20 seconds. Using the above parallel computing scheme, the running time of the nonlocal-means method is about 11 seconds accordingly. However, GVF-RM is unsuitable for the parallel computing scheme of decomposing an image segmentation task into multiple sub-image segmentation tasks, as it uses the information of the whole image in its image segmentation process.

V. CONCLUSION
In this paper, we propose a novel patch-driven tongue image segmentation method with sparse representation. The proposed method innovatively introduces sparse representation into tongue image segmentation and utilizes prior information from the tongue training images with manual segmentations. The proposed method also presents a new color-based patch pre-selection criterion to reduce the size of training patches during dictionary construction, thus saving the computational time in sparse representation. Experimental results on a tongue image dataset containing 290 subjects and also additional 10 unseen subjects demonstrate a dramatic performance improvement by the proposed method, compared to the state-of-the-art methods.