Loading web-font TeX/Math/Italic
Cracked Tongue Extraction Model Based on Improved U-Net Method | IEEE Journals & Magazine | IEEE Xplore

Cracked Tongue Extraction Model Based on Improved U-Net Method


The proposed training process diagram for the improved Unet model based on Hybrid Parallel Attention Mechanism(HPAM).

Abstract:

Tongue diagnosis holds significant importance in Traditional Chinese Medicine (TCM), with cracked tongues serving as a key diagnostic feature. However, the considerable v...Show More

Abstract:

Tongue diagnosis holds significant importance in Traditional Chinese Medicine (TCM), with cracked tongues serving as a key diagnostic feature. However, the considerable variability in the morphology, depth, and distribution of tongue cracks poses a challenge for accurate extraction. In this paper, a novel deep learning approach is proposed to enhance the decoder of the U-Net model for cracked tongue extraction by incorporating the Hybrid Parallel Attention Mechanism (HPAM). The inclusion of HPAM enables the model to better concentrate on the small-scale feature information of tongue cracks, thereby improving the accuracy of crack segmentation. Experimental results demonstrate the effectiveness of the proposed method across all three tongue crack datasets. The method achieves a MIoU of 69.31% on the open environment dataset, 76.05% MIoU on the non-open environment dataset, and an overall MIoU of 76.92% on the combined dataset. These results signify a significant improvement over existing methods. This study not only offers an effective approach for automating the extraction of cracked tongues but also contributes to the automation and accuracy of tongue diagnosis, thereby benefiting the field of TCM.
The proposed training process diagram for the improved Unet model based on Hybrid Parallel Attention Mechanism(HPAM).
Published in: IEEE Access ( Volume: 11)
Page(s): 126352 - 126364
Date of Publication: 02 November 2023
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Cracked Tongue, also known as tongue fissures, refers to the notable depressions and elevations present on the surface of the tongue. This distinctive morphology holds significant importance in facial recognition and disease detection [1], [2]. Apart from its widespread application in traditional Chinese medicine diagnosis [3], recent research has increasingly shown that cracked tongue can reflect changes in overall health and serve as an indicator for early disease detection [2]. Hence, the efficient and accurate extraction of cracked tongue information has become a crucial concern.

Previously, some researchers employed traditional methods for extracting images of fissured tongues. Li et al. used a hyperspectral tongue imaging device to capture images of the tongue and then applied a classification algorithm based on Hidden Markov Models, achieving a certain level of effectiveness [4]. With the advancement of deep learning techniques, Convolutional Neural Networks (CNNs) have become the mainstream approach in medical image analysis. Numerous researchers have applied CNNs to the domain of tongue feature extraction and recognition, yielding promising results [5], [6], [7]. For example, Huang et al. applied deep learning methods to construct a tongue segmentation model for the segmentation of mobile-acquired tongue images in open and complex environments [8]. Ruan et al. constructed an efficient tongue image segmentation model by optimizing the UNet network and designed a new network to specifically handle tongue edge segmentation [9]. Song et al. proposed RAFF-NET for tongue region segmentation [10]. The above-mentioned study achieved good results, but did not investigate tongue fissures. Existing methods that combine deep learning and tongue diagnosis primarily fall into two categories: object detection [11] and instance segmentation. For instance, Hui et al. propose a weakly supervised method for training the tooth-mark and crack detection model by leveraging fully bounding-box level annotated and coarse image-level annotated tongue images, achieving an accuracy of 0.865 in cleft palate recognition [12]. Object detection methods have shown certain effectiveness in tongue diagnosis; however, they fail to obtain precise contour lines for cracked tongue. As a result, some researchers have employed segmentation methods to extract the contour of cracked tongue. For example, Xue et al. used crack and non-crack regions to train AlexNet, extracting deep features of the crack region, and finally performed classification using Support Vector Machines (SVM) [9]. Yan et al. proposed the Segmentation-Based Deep Learning (SBDL) model for cracked tongue image extraction and recognition [10]. Li et al. improved the partial encoder of the Unet architecture by introducing a global convolutional network module to address the encoder’s inability to extract relatively abstract high-level semantic features, thereby achieving cracked tongue extraction. However, this model only achieved a MIoU score of 0.473 on the test set [11]. Although Transformer-based models have achieved excellent performance in some application scenarios, they require large training datasets [13], which leads to suboptimal results when dealing with small datasets. Moreover, the extraction of cracked tongue is susceptible to environmental interference, making it challenging to extract all the fissures accurately.

Considering the intricacy of the backdrop in the patient’s uploaded tongue photographs, in order to enhance the precision of tongue fissure extraction, this paper devises a Hybrid Parallel Attention Mechanism, augmenting the U-net framework. The primary objective is to accomplish tongue fissure extraction. Firstly, to address the issue of indistinct differentiation between the foreground and background of tongue fissures, this paper reconfigures the U-net architecture and introduces the HPAM (Hybrid Parallel Attention Mechanism) module to amplify the model’s capacity for capturing intricate details of tongue fissures and intensifying its focus on the fissure regions. This refinement aims to elevate the accuracy of segmentation. Secondly, data augmentation techniques and regularization strategies are employed to combat the issue of overfitting that often arises when training on small-scale datasets. Concurrently, the proposed model demonstrates a MIoU (Mean Intersection over Union) value of 76.92% on the test set, signifying high accuracy and robustness in the task of tongue fissure extraction. This research provides valuable insights for subsequent investigations.

The principal contributions of this paper are outlined as follows:

  1. In the realm of tongue fissure segmentation, this study proposes an enhanced U-net-based algorithm for tongue fissure extraction, effectively overcoming the difficulties and challenges associated with this task.

  2. A hybrid parallel attention mechanism strategy is introduced to facilitate the model in placing greater emphasis on the fissures themselves, thereby reducing interference stemming from complex background and environmental factors present in the tongue fissure images. Consequently, the performance of the model in tongue fissure extraction is substantially improved.

  3. A dataset of tongue fissures is generated. To address overfitting concerns when working with this limited dataset, data augmentation techniques and regularization strategies are employed, significantly bolstering the model’s generalization capabilities.

The structure of this paper is organized as follows: The first section introduces the background and relevant research pertaining to this study, as well as the key contributions made. The second section delineates the data sources and pre-processing methodologies employed, along with the proposed methodology. The third section verifies the efficacy of the proposed method in tongue fissure extraction through experimentation and comparative analysis with alternative approaches. The fourth section discusses the findings of this study and presents future research prospects. Finally, the fifth section concludes this research endeavor.

SECTION II.

Materials and Methods

A. Data Sets and Preprocessing

1) Data Sources

In this paper, we collected 132 images of tongues exhibiting cracks using web crawlers and color images from TCM books [27]. Additionally, we included 200 tongue photos obtained from Guangdong University of Traditional Chinese Medicine, resulting in a comprehensive dataset of 332 images. This dataset was subsequently divided into two subsets: 155 images captured in an open environment and 177 images captured in a non-open environment. The “open environment” subset comprises images that encompass the tongue, along with partial depictions of the head and body, providing visibility of the surroundings. In contrast, the “non-open environment” subset includes images solely focused on the cracked tongue, with minimal inclusion of background elements, thus mitigating interference from extraneous factors.

2) Data Labeling and Analysis

The dataset utilized in this paper adheres to the Pascal VOC2007 standard format, with the image labeling tool Labelimg employed for the annotation process. The labeling procedure using this software is visually illustrated in Figure 1. Specifically, all tongue cracks in the image are depicted with a polygon box marked as a crack and a multifold segment.

FIGURE 1. - Schematic diagram of dataset labeling.
FIGURE 1.

Schematic diagram of dataset labeling.

In the process of tongue crack detection and extraction, the accurate extraction of tongue cracks is challenging due to several factors. These include the large number of tongue cracks, the minimal contrast between the cracks and the background, as well as variations in environmental conditions and filming equipment. For instance, as shown in Figure 2a, the captured image may be affected by varying light intensities, resulting in a darker appearance of the tongue surface. In Figure 2b, the image is influenced by the shooting distance, leading to a smaller tongue target area, albeit with fewer tongue cracks. Figure 2c illustrates an image captured at a close distance with shallow cracks, resulting in limited differentiation between the cracks and the background. Figure 2d demonstrates a scenario where the tongue exhibits a high number of complex cracks. Achieving pixel-level crack extraction using target detection methods becomes challenging in such cases. Therefore, this paper utilizes image segmentation techniques to accomplish crack extraction. The aforementioned characteristics of the image data intensify the challenges associated with tongue crack extraction.

FIGURE 2. - Challenges encountered in the tongue crack extraction task.
FIGURE 2.

Challenges encountered in the tongue crack extraction task.

3) Data Enhancement

Because the original dataset is small, consisting of only 332 cracked tongue images, a dataset of this size can lead to poor model generalization performance. Therefore, this paper randomly selects approximately 30% of the data from both the open environment dataset and the non-open environment dataset as the test set. Before doing so, six data enhancement methods are applied to augment the remaining images. These methods include image flip (Figure 3a), random rotation (Figure 3b), contrast enhancement (Figure 3c), random color dithering (Figure 3d), brightness enhancement (Figure 3e), and color enhancement (Figure 3f). The enhanced data were considered to be filtered and labeled, and after removing some images that could not be recognized by the naked eye due to texture loss caused by overexposure, there were 759 enhanced images in the open environment dataset and 720 enhanced images in the non-open environment dataset. The two datasets are divided into training and validation sets in the ratio of 8:2. By undergoing the data enhancement process, the new images acquire more comprehensive image features, which in turn enhance the training model’s performance and yield improved results.

FIGURE 3. - Image enhancement methods.
FIGURE 3.

Image enhancement methods.

B. Model Structure Design and Principles

1) U-Net Model

The U-net derives its name from its network structure resembling the shape of the letter “U. ” It is a convolutional neural network model specifically designed for image segmentation tasks [14]. It demonstrates remarkable suitability for medical image segmentation, particularly when accurate segmentation of smaller targets such as cells, blood vessels, and the like is required. As illustrated in Figure 4, the U-Net model bears resemblance to a self-encoder, encompassing both a downsampling path and an upsampling path. The downsampling path comprises a convolutional block and a downsampling block, enabling the extraction of global features from the input image. Meanwhile, the upsampling path facilitates the restoration of spatial information in the segmentation output. During the training phase, the U-Net employs a technique called jump-join, effectively connecting the feature maps from the downsampling path to their corresponding counterparts in the upsampling path, thereby achieving precise and detailed segmentation outcomes. The versatility of the U-Net model extends beyond medical image segmentation, holding promising applications in various other domains requiring image segmentation. Its successful deployment also yields valuable insights for image segmentation tasks in diverse fields.

FIGURE 4. - U-net model structure.
FIGURE 4.

U-net model structure.

2) Hybrid Parallel Attention Mechanism

To focus on tongue fissure information in tongue images, a Hybrid Parallel Attention Mechanism (HPAM) is designed in this paper, which is computed by three different attention modules in parallel on three tracks, and finally the output of the three modules is summed up. The Hybrid Parallel Attention Mechanism (HPAM) consists of three main parallel modules: the SENet module [15], the SAM [16] module, and the CAM module [16]. Given a feature map input, where, and denote the height, width, and number of channels of the feature map, respectively, each of these three modules processes the input feature map to generate a new feature map.

SENet module: The SENet module first obtains the channel descriptor through a global averaging pooling operation and then implements the rescaling of the channel through two fully connected layers.

Assume that the weights of the two fully connected layers in the SENet module are W_{1} \in {\rm \mathbb {R}}^{C/r\times C} and W_{2} \in {\rm \mathbb {R}}^{C\times C/r} , where r is the ratio of the number of channels reduced, then the output of the SENet module can be expressed as:\begin{equation*} X_{\textrm {SENet}} =F_{\textrm {scale}} \left ({{X,W_{2} \delta \left ({{W_{1} F_{\textrm {avg}} (X)} }\right)} }\right)\cdot X \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where F_{scale} stands for the channel rescaling (or scale) function, F_{avg} denotes the global average pooling, \delta denotes the ReLU activation function, and W_{1} and W_{2} are the weights of the two fully connected layers.

SAM module: The spatial attention mechanism (SAM) obtains the spatial attention map by calculating the maximum and average values of the input feature map in the channel dimension. The attention map is then activated using a sigmoid function. the output of the SAM module can be expressed as:\begin{equation*} X_{SAM} =\sigma \left ({{F_{\max } (X)+F_{\textrm {avg}} (X)} }\right)\cdot X \tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \sigma represents the sigmoid activation function, F_{max} denotes the maximum operation in the channel dimension, and F_{avg} denotes the global average pooling.

CAM module: The channel attention mechanism (CAM) begins by computing the maximum and average values of the input feature map in the spatial dimension. This calculation results in a channel attention map, which is subsequently activated using a sigmoid function:\begin{equation*} X_{CAM} =\sigma \left ({{F_{\max } \left ({{X^{\prime }} }\right)+F_{\textrm {avg}} \left ({{X^{\prime }} }\right)} }\right)\cdot X \tag{3}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \sigma symbolizes the sigmoid activation function, F_{max} denotes the maximum operation in the spatial dimension, F_{avg} denotes the average pooling in the spatial dimension, and {X}' is the result of global pooling of X in the spatial dimension, i.e., {X}'=GlobalPool(X) .

After these three modules have processed the input feature maps, we will get three new feature maps, denoted as X_{SENet},X_{SAM} and X_{CAM} . Then, we add these three feature maps by value to obtain the output feature map X_{HPAM} of HPAM:\begin{equation*} X_{HPAM} =X_{SENet} +X_{SAM} +X_{CAM} \tag{4}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The structure diagram of HPAM is shown in Fig. 5. By introducing SENets, it can make the network more sensitive to such fine features as tongue crack, and thus improve the accuracy of crack segmentation. The use of Spatial Attention can make the network pay more attention to the region closely related to the tongue crack, thus avoiding some misclassification problems. The introduction of Channel Attention allows the network to adjust the weights adaptively between different channels, thus enhancing the discriminative ability of tongue crack and further improving the segmentation accuracy. The experiments in Section III-F of this paper also verify this idea.

FIGURE 5. - HPAM structure.
FIGURE 5.

HPAM structure.

3) HAU-Net Model

The U-Net model has found extensive applications in medical image segmentation. However, one limitation is that the decoder component may struggle to accurately reconstruct detailed information. This is due to the loss of spatial information during the downsampling process, making it challenging for the decoder to recover precise details such as edges and contours. This issue becomes particularly prominent in the context of tongue crack extraction, where distinguishing crack edges from the similarly colored tongue body proves difficult.

To address this challenge, this paper proposes the HAU-net, which integrates the Hybrid Parallel Attention Mechanism (HPAM) into the U-Net decoder. As depicted in Figure 6, three HPAM modules are incorporated into the decoder network after the upsampling stage in U-Net. This allows for the fusion of features from different scales in the encoder and extraction of more informative features during the decoder stage using multi-track parallel attention modules.

FIGURE 6. - HAU-net model structure.
FIGURE 6.

HAU-net model structure.

The HPAM module is not embedded in the encoder because the downsampling operation of the U-Net encoder leads to a loss of detailed information in the original tongue crack image, resulting in poor recognition performance, as demonstrated in the experimental section of Section III in this paper. By incorporating the HPAM modules in the decoder, the HAU-net model can focus more on the relevant tongue crack information within the mixed feature maps of different scales. Consequently, this approach effectively enhances the accuracy and efficiency of the model in detecting and identifying tongue cracks.

C. Loss Function

The function utilized in the tongue crack extraction model consists of two components: the cross-entropy loss and the Dice loss. This can be represented by Equation 5.\begin{align*} {\mathcal{ L}}_{\textrm {total}} &={\mathcal{ L}}_{\textrm {BCE}} +{\mathcal{ L}}_{\textrm {Dice }} \tag{5}\\ {\mathcal{ L}}_{\textrm {BCE}} &=-\sum \limits _{i} {\left ({{t_{i} \ln \left ({{\hat {t}_{i}} }\right)+\left ({{1-t_{i}} }\right)\ln \left ({{1-\hat {t}_{i}} }\right)} }\right)} \tag{6}\end{align*}

View SourceRight-click on figure for MathML and additional features. where t_{i} and \hat {t}_{i} denote the cracked region of the tongue predicted by the network and the cracked region of the true value, respectively. To deal with the class imbalance problem, this paper also uses the dice loss [12], which is defined as follows:\begin{equation*} {\mathcal{ L}}_{\textrm {Dice}} =1-\frac {2\cdot \left \langle{ {t_{(h,w)},\hat {t}_{(h,w)}} }\right \rangle +\sigma }{\left \|{ {t_{(h,w)}} }\right \|_{1} +\left \|{ {\hat {t}_{(h,w)}} }\right \|_{1} +\sigma } \tag{7}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where _{(h,w)} is the pixel coordinate, and \sigma is the Laplacian smoothing factor that accelerates the convergence rate of the network. Here, we will set \sigma to 1e-5 in our work.

SECTION III.

Experimental Results and Analysis

A. Experimental Environment and Configuration

The test environment in this paper is: intel Ⓡ Core™ i5-6500Q CPU @2.30GHz, GeForce GTX 3090 (24GB) graphics card, Ubuntu 18 OS, python 3.8.

The parameters trained in this paper are listed in the following table:

B. Training Process

The experimental training process is illustrated in Figure 7. Initially, the dataset was divided into a test set comprising 30% of the images, while the remaining images were further partitioned into an 80% training set and a 20% validation set after pre-processing and data augmentation. The training and validation sets were utilized for model training, hyperparameter tuning within each epoch, and optimization method adjustment, aiming to achieve optimal results for tongue crack extraction.

FIGURE 7. - Training flow chart.
FIGURE 7.

Training flow chart.

For all models, an input size of 512\times512 was utilized, with a batch size of 4. The Adam optimizer was employed, with an initial learning rate of 1e-4 and momentum set to 0.9. The learning rate was adjusted using the cosine annealing algorithm.

To assess the model’s superiority, a comparison was made with mainstream segmentation algorithms, including U-net_vgg16 with vgg16 as the backbone network, U-net_resnet50 with resnet50 as the backbone network, U-net++ [17], deeplabV3 [18], Segnet [19], and FRCnet [20]. Initially, each model exhibited a relatively high loss. However, after 300 iterations of network training, all seven models displayed a converging trend in both the training and validation sets, reaching their lowest loss points, as depicted in Figure 8. The figure demonstrates that U-net_resnet50 achieves fast convergence with a training set loss of approximately 0.2. However, its performance in the validation set is less favorable, oscillating around 0.35. On the other hand, the U-net model enhanced with the HPAM module, referred to as HAU-net, exhibits the lowest loss convergence in the validation set. This outcome indicates that HPAM enhances the generalization performance of the original model.

FIGURE 8. - Training loss graph.
FIGURE 8.

Training loss graph.

C. Evaluation Indicators

In order to verify the performance of the model for tongue crack extraction, MIoU, Recall, Precision, Accuracy, and Dice coefficients are used as evaluation indexes for model segmentation performance in this paper. The calculation equations are as follows:\begin{align*} \textrm {Acc}&=\frac {\textrm {TP}+\textrm {TN}}{\textrm {TP}+\textrm {TN}+\textrm {FP}+\textrm {FN}} \tag{8}\\ \textrm {Recall}&=\frac {TP}{TP+FN} \tag{9}\\ \textrm {IoU}&=\frac {\textrm {TP}}{\textrm {FN}+\textrm {FP}+\textrm {TP}} \tag{10}\\ \textrm {MIoU}&=\frac {1}{N}\sum \limits _{i=1}^{N} {\textrm {IoU}} \tag{11}\\ \textrm {Dice}&=\frac {2\times \textrm {TP}}{2\times \textrm {TP}+\textrm {FP}+\textrm {FN}} \tag{12}\end{align*}

View SourceRight-click on figure for MathML and additional features.

In the equation: (True Positive, TP) is the number of correct results judged as correct; False Positive,FP is the number of correct results judged as wrong; True Negative TN is the number of wrong results judged as wrong; False Negative,FN is the number of wrong results judged as wrong. False Negative (FN) is the number of incorrect results judged as correct.

IoU is a commonly used evaluation metric to measure the degree of overlap between the predicted segmentation results and the true segmentation results, which is defined as the ratio of the intersection area of the predicted segmentation results to the merged area of the true segmentation results. It is a metric used to measure the average IoU of the model over multiple categories. It is the average of IoU for each category, as shown in Equation 10, where C denotes the number of categories and IoU_i denotes the IoU of i categories.

D. Performance Comparison of the Models Before and After Data Augmentation

To discuss the impact of data augmentation methods in data preprocessing on model performance, this study conducted comparative experiments on U-net_vgg16, HAU-net, U-net++, DeeplabV3, Segnet, and FRCnet before and after data augmentation on the test dataset, as shown in Table 2. It can be observed that all models show varying degrees of performance improvement after using data augmentation in preprocessing. The most significant improvement is observed in the U-net++ model, with a 24.75 increase in the MIoU metric. Augmented models exhibit better robustness when facing changes in illumination, noise, or shooting angles in input data. Data augmentation methods assist the model in adapting to these variations, thereby enhancing the model’s robustness. By introducing techniques such as random rotation and random color jitter, the model can learn more diverse patterns, making it more stable for the task of tongue crack extraction under different conditions.

TABLE 1 Experimental Environment Configuration Parameters
Table 1- 
Experimental Environment Configuration Parameters
TABLE 2 Performance of the Models Before and After Data Augmentation
Table 2- 
Performance of the Models Before and After Data Augmentation

E. Performance Comparison of Different Models Under Different Data Sets

To verify the performance of the model in this paper on the overall dataset containing all images, the open environment dataset, and the non-open environment dataset, HAU-net is tested against U-net, U-net++, DeeplabV3, Segnet, and FRCnet on the test set in this paper, respectively.

1) Results on the Overall Dataset

The experimental results are shown in Table 3, and the results indicate that the HAU-net model proposed in this paper shows the best performance in all these metrics. Specifically, it achieves 76.92 on MIoU, which is significantly better than other models. Similarly, on the Dice coefficient, HAU-net reaches 0.810, which is the highest among all models. On the Recall metric, HAU-net also outperforms the other models (0.847) and is second only to the U-net++ model. Finally, HAU-net achieves 97.76 in Accuracy, a score that is the highest among all models, including Hausdorff Distance(HD), which measures the dissimilarity between predicted and ground truth masks.

TABLE 3 Performance of Different Models on the Overall Dataset
Table 3- 
Performance of Different Models on the Overall Dataset

U-net_Vgg16 and Resnet50 also perform quite well when used as a backbone, especially on MIoU, reaching 75.19 and 75.07, respectively (based on Resnet50). This is the reason why we chose to improve based on U-net in this paper. Other models, such as U-net++, DeeplabV3, Segnet, and FRCnet, are competitive in some metrics, but in general, their performance has yet to be improved compared to the HAU-net model.

2) Results on the Open Environment Dataset

The test results for the open environment are shown in Table 4, and the results show that the seven models tested in this paper showed significant degradation in several evaluation metrics due to the interference of different backgrounds in the open environment. HPAM can weight the targets for the desired segmentation in space and channel, which mitigates the interference of backgrounds, so HAU-net still achieved the best results in several metrics on the open environment dataset results. It achieves 69.31 on MIoU, which is significantly better than other models. Also on Recall, HAU-net achieves 80.87, which is the highest among all models. In terms of the Dice coefficient, HAU-net also outperforms other models (0.806). In terms of HD, HAU-net is 6.22, which is better than less than the other models, so HAU-net still achieves the best results in several metrics on the open environment dataset.

TABLE 4 Performance of Different Models Under Open Environment Datasets
Table 4- 
Performance of Different Models Under Open Environment Datasets

3) Results on the Non-Open Environment Dataset

The test results for the non-open environment dataset are shown in Table 5, and the results indicate that the HAU-net model still shows the best performance in all these metrics. Specifically, it achieves 76.05 on MIoU, which is significantly better than the other models. Also in Recall, HAU-net reached 88.14, which is the highest among all models. In terms of the Dice coefficient, HAU-net also outperformed the other models (0.847) and was second only to the Resnet50-based U-net model. Finally, HAU-net achieved 97.79 in Accuracy and 4.27 in HD, that are both the highest among all models.

TABLE 5 Performance of Different Models Under Non-Open Environment Datasets
Table 5- 
Performance of Different Models Under Non-Open Environment Datasets

Combining the performance of all three datasets, HAU-net shows excellent performance. It indicates that in the tongue crack segmentation scenario facing background interference and poor foreground hind scene separation, HAU-net improved with HPAM can overcome these problems and maintain good generalization performance.

4) Images of Model Prediction Results

Figure 9 illustrates the effectiveness of the proposed model compared to various other models in the context of tongue cleft extraction. To ensure a fair comparison, we carefully selected test set images within each category for evaluating segmentation outcomes. Specifically, subsets ‘a’ and ‘b’ were chosen from non-open environment datasets, subsets ‘c’ and ‘d’ from open environment datasets, and subset ‘e’ from external data sources [27].

FIGURE 9. - Images of different models prediction results.
FIGURE 9.

Images of different models prediction results.

As depicted in the figure, the SegNet model tends to produce more cluttered artifacts within the original image during prediction. U-netpp exhibits omissions in images ‘a’ and ‘d’, while FRCnet displays omissions in image ‘d’ as well. Notably, the SegNet model consistently generates more cluttered islands in the original image during prediction.

In Figure 8, column ‘c,’ it’s evident that each model exhibits varying degrees of omissions, primarily due to the intricate and challenging nature of tongue fissures in image ‘c.’ However, HAU-net demonstrates superior generalization capabilities, with its segmentation results in column ‘c’ closely aligning with the original labels.

In the case of image ‘e,’ Deeplabv3 and HAU-net exhibit superior performance, while other models introduce additional noise in their predictions. The performance in image ‘e’ serves as an indicator of the models’ generalization capabilities.

The combined results from subsets ‘a’ to ‘e’ demonstrate that the model consistently excels in tongue cleft extraction, reflecting its strong overall performance.

F. Inference Time and Parameter Count Results

In addition to accuracy, the size and detection speed of the model are also of significant importance, especially for the task of tongue fissure extraction. To assess whether the algorithm’s detection speed can achieve real-time detection, we conducted tests on the average detection speeds of different models on the test set, and the test results are detailed in Table 6. From the perspective of inference time, HAU-net, U-net_resnet50, U-net_vgg16, DeeplabV3, and Segnet demonstrate similar inference speeds. However, in terms of model parameter size, both DeeplabV3 and Segnet have higher parameter counts compared to HAU-net, with U-net_resnet50 having an even higher parameter count of 43.93M. In contrast, FRCnet stands out with the lightest inference speed and parameter count, but this comes at the cost of a significant loss in accuracy, making it less suitable for the task of tongue fissure extraction.

TABLE 6 Inference Time and Parameter Count Results
Table 6- 
Inference Time and Parameter Count Results
TABLE 7 Results of Ablation Experiments
Table 7- 
Results of Ablation Experiments

The proposed HAU-net in this paper maintains stable inference speed while moderately increasing the parameter count and exhibits superior accuracy. In practical applications of fissure extraction, HAU-net efficiently performs accurate work.

G. Ablation Experiments

To verify whether the addition of HPAM to U-net is due to performance optimization of a single module of either or a combination of multiple modules, this paper validates the effectiveness of the role of multiple hybrid attention mechanisms in series and in parallel. The text designs several comparative architectures:

  1. Add CAM to the U-net decoder without adding SENet and SAM, which is noted as CAM in Table 6.

  2. Add SAM to the U-net decoder without adding CAM and SENet, and record it as SAM in Table 6.

  3. Add SENet to the U-net decoder without adding CAM and SAM, and record as SE in Table 6.

  4. Add SAM and CAM to the U-net decoder without adding SENet, and record as SAM in Table 6.

  5. Add SENet and CAM to the U-net decoder without adding SAM, and record as SAM_CAM in Table 6.

  6. Add SENet and SAM to the U-net decoder without adding CAM, which is recorded as SE_CAM in Table 5.

  7. Add SENet, SAM and CAM in U-net decoder in tandem, noted as SE_SA_CA(S) in Table 6.

  8. Add SENet, SAM and CAM in parallel to the U-net decoder, which is recorded as SE_SA_CA(P) in Table 6.

  9. Add SENet, SAM, and CAM in parallel to the encoder of U-net, which is noted as SE_SA_CA(P)_EN in Table 5.

The structure of the ablation experiment is shown in Table 6:

From the ablation experiments presented in Table 5, the following conclusions can be drawn:

  1. When using individual attention mechanisms alone, the SENet performs the best, achieving a MIoU of 76.43% on the test set. This is likely because SENet learns channel dependencies through the introduction of the SE block, which adaptively adjusts the weights of each channel in the feature map. This enables the model to focus on the most relevant features and enhances its generalization ability.

  2. When combining attention mechanisms, the combination of SE and SAM (SE+SAM) yields the highest Dice metric of 0.814. This is possibly due to SAM learning pixel relationships to adjust the weights of each pixel in the tongue crack image, complementing the attention mechanism of the SE channel. The combination of these two mechanisms produces synergistic effects, resulting in improved performance.

  3. When using all three attention networks simultaneously, the MIoU of the SE+SAM+CAM model with parallel attention mechanisms (P) across three branches surpasses the MIoU of the three attention mechanisms in series (S), reaching 76.92%. This is the highest performance among all the compared models. The parallel structure allows for better performance as each attention mechanism addresses different aspects. SENet learns channel dependencies, SAM focuses on spatial attention, and CAM emphasizes channel attention. In contrast, using these three mechanisms in series may lead to incomplete information or interference, reducing network performance. Thus, the ablation experimental results indicate that the parallel structure is a preferable choice for achieving better results in the tongue crack extraction task.

Furthermore, to visually explore the impact of HPAM on the model’s crack extraction ability, weights from the last layer of each ablation model were extracted using the Grad-CAM [22] technique to generate heatmaps. As shown in Figure 10, the tongue crack image in the open environment (a) and non-open environment (b) are presented. In the SE+SA+CA(P) model with HPAM added, the weight of the tongue crack region is significantly higher compared to other comparison models. Heatmaps of the SE and SE_SAM segmentations reveal lower weights (lighter color) in the marked yellow box region, indicating that the model fails to extract the tongue crack correctly in this region, resulting in reduced accuracy and missed detections. In contrast, in the SE+SA+CA(P) model, the weight at the edge of the tongue crack image increases, represented by a darker red color. This indicates successful extraction of tongue cracks, even in challenging anterior and posterior views.

FIGURE 10. - Heat map of attention for ablation experiments.
FIGURE 10.

Heat map of attention for ablation experiments.

SECTION IV.

Discussion

The main goal of this study is to design and test a U-net model embedded with HPAM to improve the accuracy of tongue crack extraction. Our experimental results strongly demonstrate the superiority of this structure in handling the tongue cracking task.

In comparison experiments, our model HAU-net significantly outperforms the original U-net, U-netpp, Deeplabv3, Segnet, and FRCnet models in key performance metrics such as recall, precision, MIoU, and Dice values. These results validate the superior performance of our model in identifying and segmenting tongue cracks, which may be attributed to its ability to effectively utilize the multi-track parallel attention mechanism to extract and exploit richer and more complex features. To further validate the generalization performance of the HAU-net model, we used the graph of Meng-Yi Li’s paper and the experimental results [11] as a comparison, as shown in Figure 11, where the method proposed by Meng-Yi Li has significant missed detection in both a and c, marked by yellow boxes in the graph. Although we use Meng-Yi Li’s image annotation, the crack extraction results of the method in this paper, in Fig. 11c, fit the original image more closely and are even more accurate than the annotated image. In comparison, the HAU-net method has a significant improvement in the tongue crack extraction task.

FIGURE 11. - Tongue crack extraction performance comparison chart.
FIGURE 11.

Tongue crack extraction performance comparison chart.

For practical applications and future research, this improved model may have far-reaching implications. First, due to its excellent performance in tongue fissure extraction, this model has the potential to further improve the accuracy and efficiency of objectified diagnosis in TCM tongue diagnosis. Second, by showing that more complex attentional mechanisms can effectively improve model performance, our study may encourage more researchers to incorporate and explore this novel structure in their models.

Although our model achieves good results in extracting tongue cracks, we believe there is still room for further improvement. First, we hope to further improve the performance of the model by optimizing and tuning the multi-track parallel attention mechanism. Second, we also hope to apply this model to other image segmentation tasks in the future to explore its possibilities in a wider range of domains. We also plan to collect and build a larger dataset of tongue cracks to fully validate the robustness and generalization ability of our model.

SECTION V.

Conclusion and Future Work

  1. To develop a suitable model for tongue crack extraction, this study proposes the HAU-net model by adding the HPAM module to the decoder network structure of the U-net. Compared with the other seven tested models, HAU-net all showed different degrees of improvement. HAU-net achieved the highest MIoU of 76.92, recall of 88.71%, accuracy of 99.76%, and Dice of 0.810 on the overall data set. the MIoU improved by 1.73%; compared with the original U-net model, the MIoU and DICE were significantly improved, and the number of model parameters and inference rate did not change significantly.

  2. The results of the ablation experiments show that the enhancement of the model by adding HPAM to the U-net decoder section is the most obvious. And the HPAM structure is more effective in parallel than in series.

  3. This study not only provides an effective method for the automatic extraction of cracked tongue, but also contributes to the automation and accuracy of tongue diagnosis. This may help to improve and optimize the process of TCM diagnosis, especially for tongue diagnosis, which is an important diagnostic component.

  4. In our future work, we will explore and assess the application of various advanced attention mechanisms to enhance model performance. This may include multi-head attention, cross-modal attention, and more. We will delve into these techniques and endeavor to integrate them into our model to improve its performance in tongue fissure extraction tasks. Additionally, we will continue to seek out additional tongue fissure datasets and explore various data augmentation and enhancement techniques to further enhance the model’s robustness and generalization capabilities.

References

References is not available for this document.