Introduction
Traditional Chinese Medicine (TCM) has a history spanning thousands of years, encompassing diagnostic methods like tongue diagnosis, face diagnosis, hand diagnosis, and pulse diagnosis, forming a comprehensive diagnostic system [1], [2], [3], [4]. Among these methods, tongue diagnosis is one of the most mature and widely used in clinical practice [5], [6]. In TCM theory, characteristics such as the shape and color of the tongue reflect the body’s internal health status, including organ functions, blood circulation, and internal balance of cold and heat. Through careful observation of the tongue’s appearance, TCM practitioners can identify various health conditions and select appropriate treatments [7]. However, traditional tongue diagnosis relies on subjective judgments by TCM practitioners, leading to potential variations in diagnoses. To enhance the objectivity and consistency, many tongue diagnosis models have been developed to assist practitioners in more accurate tongue image diagnosis [8], [9], [10], [11].
Tongue diagnosis models are trained and analyzed using tongue images acquired from diagnostic image collection devices. These devices typically capture images directly from the patient’s tongue, often including environmental factors in the images. Using raw images for training can result in these complex environmental factors interfering with the training process, reducing the model’s prediction accuracy [12]. Therefore, separating the tongue portion from the surrounding environment in tongue images is crucial to improving the efficiency of feature learning by deep learning models.
Research on tongue image segmentation has been ongoing for a long time. Before the maturity of deep learning technology, researchers employed various traditional non-deep learning techniques for tongue image segmentation, including color thresholding [13], edge detection [14], active contour models [15], and region growing and merging [16]. The active contour model is particularly prevalent; for instance, Guo et al. used a two-stage K-means clustering method based on initial tone boundaries, followed by ACM for tongue image segmentation [17]. However, due to inherent limitations, traditional methods require stringent operating conditions, leading to significant shortcomings in their generalizability.
With the continuous advancement of deep learning technology, various deep learning models have been applied to the field of tongue image segmentation, such as U-Net [18], Mask R-CNN [19], and the DeepLab series [20]. Researchers have also developed enhanced tongue segmentation models based on these models. For example, Li et al. proposed a lightweight encoder-decoder architecture for rapid and accurate tongue image segmentation, addressing class imbalance and achieving high IoU on newly constructed datasets [21]. Huang et al. combined U-Net with residual soft connection modules and salient image fusion modules to create OET-NET, a deep neural network tongue image segmentation system suitable for mobile devices, demonstrating superior performance in open environments [22]. Cai et al. introduced TSRNet, a new tongue image segmentation network based on an encoder-decoder framework incorporating global and local refinement modules, improving segmentation accuracy through end-to-end training and validating its effectiveness on real tongue image datasets [23]. Compared to traditional tongue image segmentation techniques, deep learning models excel in extracting detailed local features, making the learning process more straightforward and efficient. These characteristics endow deep learning models with higher generalizability and significantly improve segmentation accuracy.
Despite the generally high accuracy of deep learning models in performing tongue image segmentation tasks, there are still unresolved challenges and issues. Most existing tongue image segmentation models are based on convolutional neural networks, and researchers often increase the number of convolution kernels to enhance image feature learning. However, this approach of increasing model complexity may reduce the generalization ability of the model and greatly reduce the performance of the model on zero-shot tasks. Additionally, the lack of a large and publicly available database limits the training of tongue image segmentation models. Researchers often rely on self-collected tongue images for model training, a laborious and time-consuming process that results in relatively small training and validation datasets, which undoubtedly weakens the accuracy of the segmentation model on zero-shot segmentation tasks. The emergence of SAM (Segment Anything Model) [24] offers a promising solution to these problems. Thanks to the unique multi-head attention mechanism of Transformers [25], SAM excels in zero-shot segmentation tasks on unseen datasets, showcasing its robust generalizability. However, SAM’s image encoder, based on the basic Vision Transformer (ViT) [26], leads to a large number of parameters, resulting in long runtimes.
To address these issues, we proposed an improved SAM-based model, named Tongue-LiteSAM. This model modified the SAM baseline model’s image encoder by replacing it with two lightweight ViT-Tiny [27] models, significantly reducing the model’s parameter count and runtime. Additionally, we applied data perturbation during the bounding box training process to achieve data augmentation, which further improves the performance of the model on zero-shot tasks. Moreover, we trained and evaluated the model using six diverse datasets, whose variability could better provide us with the evaluation results of the model on zero-shot tasks.
Main Contributions:
Tongue-LiteSAM Model Proposal: We proposed the Tongue-LiteSAM model, which improved the image encoder part of the original SAM model by replacing it with two lightweight ViT-Tiny models, reducing the overall parameter count and improving model efficiency.
Generalizability Enhancement: To enhance the model’s generalizability and ensure robust performance across different data sources, we incorporated data perturbation during the bounding box training process and trained and evaluated the model using six diverse datasets. These datasets covered various features, backgrounds, and scenarios, allowing for a comprehensive evaluation of the model’s generalizability, ensuring its reliability in practical applications.
Comparative Analysis: We compared the Tongue-LiteSAM model with several popular tongue image segmentation methods. Experimental results showed that the Tongue-LiteSAM model excels in tongue image segmentation tasks, achieving higher segmentation accuracy while significantly reducing the model’s parameter count compared to the baseline SAM model, improving efficiency and practicality.
This paper is divided into seven sections. The subsequent content covers six sections: Section II reviews related technical background, Section III details specific strategies for model improvement, Section IV introduces the experimental data and configurations used, Section V analyzes and discusses the experimental results, and Section VI and VII summarize the research findings, conclusions, and future research prospect.
Related Work
A. Segment Anything Model
Segment Anything Model is a semantic segmentation model based on Vision Transformers, released by Meta AI in 2023. It comprises three main components: Image Encoder, Prompt Encoder, and Mask Decoder [24], as shown in Figure 1. The Image Encoder uses a Vision Transformer pre-trained with MAE [28] to encode high-resolution images. The Prompt Encoder processes prompts differently depending on their form: points and boxes are represented using position encoding [29]; texts are processed using CLIP’s text Transformer [30]; and masks are encoded through convolution kernels. The Mask Decoder effectively combines image embeddings, prompt embeddings, and output labels to generate the final image segmentation mask, yielding clear segmentation results. SAM was trained on a large dataset containing 11 million images and 1 billion masks, excelling in zero-shot segmentation tasks on previously unseen datasets, demonstrating its strong generalizability.
B. Segment Anything Model in Medical Imaging
Currently, SAM has been applied to medical imaging processing. MedSAM [31] is an optimized model for general medical image segmentation tasks. It achieves exceptional performance in various medical image segmentation tasks, including brain hemorrhage CT, glioma MRI, and lung X-rays, by fine-tuning SAM on a multimodal dataset containing over one million medical images and masks. PolypSAM [32] is a SAM variant fine-tuned for polyp segmentation tasks, achieving over 88% Dice coefficient on multiple public datasets, demonstrating excellent segmentation capabilities. UR-SAM [33] enhances the reliability of automatic medical image prompt segmentation by evaluating and utilizing the uncertainty of segmentation results. SegVol [34] is an interactive volumetric medical image segmentation model trained on large-scale unannotated and small annotated CT data, supporting segmentation of over 200 anatomical structures and outperforming existing state-of-the-art methods in multiple benchmark tests. TongueSAM [35] is a tongue segmentation model based on SAM. This model uses the zero-shot generalization capability of SAM and an object detection cue generator based on an existing object detection model to achieve automatic tongue segmentation and demonstrates excellent performance in zero-shot scenarios.
C. Lightweight Handling of Segment Anything Model
Although the basic SAM model has achieved considerable success, there is room for improvement. There are already many lightweight methods [36], [37], including many lightweight methods for SAM. Zhang et al. [38] proposed an acceleration method for SAM based on knowledge distillation, replacing the original encoder with a lightweight ViT-Tiny image encoder. They trained the image encoder separately using decoupled knowledge distillation to reduce computation and improve processing speed while maintaining segmentation quality. Xiong et al. [39] pre-trained some lightweight image encoders (ViT-Tiny/ViT-Small) on the ImageNet-1K dataset using their self-designed SAMI (SAM-leveraged masked image pre-training) technique and replaced the original image encoder in SAM. They then fine-tuned the entire model on the SA-1B dataset, resulting in an EfficientSAM model with significantly reduced parameters (20 times less) and improved running speed (20 times faster) while maintaining comparable accuracy (only a 2% drop).
D. Vision Transformer
Vision Transformers (ViTs) have demonstrated outstanding performance in various visual tasks, surpassing traditional convolutional neural networks with their powerful generalization capabilities. To meet the deployment needs of different application scenarios, researchers have developed smaller versions of ViTs, such as ViT-Small and ViT-Tiny [27], to replace the original ViT-Huge, ViT-Large, and ViT-Base. These smaller models aim to improve efficiency by reducing parameter count and computation costs while retaining the advantages of ViTs. MobileViT [40] is a typical example, combining the strengths of ViTs with convolution operations to enhance generalization capability while being more efficient in memory and computation costs compared to lightweight CNN models. This hybrid strategy has been widely adopted in subsequent studies, such as LeViT [41], Next-ViT [42], Castling-ViT [43], and EfficientViT [44].
E. Knowledge Distillation
Knowledge distillation (KD) is a model compression technique that transfers knowledge from a large, complex teacher model to a small, efficient student model. The goal of this technique is to reduce the model’s size and computational cost while maintaining comparable performance, making the model more suitable for deployment in resource-constrained environments. Hinton et al. first introduced the concept of knowledge distillation in 2015 [45]. Romero et al. expanded the concept by allowing the training of a deeper and thinner student network using the intermediate representations learned by the teacher as hints to improve the student’s training process and final performance, thereby achieving better generalization or faster running speeds [46]. Zhao et al. proposed a decoupled KD method, achieving better results and training efficiency on the CIFAR-100, ImageNet, and MS-COCO datasets compared to complex feature methods [47].
Methods
A. Tongue-LiteSAM: Improved Model Structure
Given the long training and prediction process caused by the high number of parameters (over 632M) of ViT-H as the image encoder in the original SAM model, we adopted a divide-and-conquer strategy, focusing on optimizing the image encoder part while keeping the mask decoder unchanged. To this end, we designed an innovative image encoder structure to improve the generalization performance of the model and accelerate the inference speed by reducing the number of parameters.
As shown in Figure 2A, our new image encoder structure consists of two independent ViT-Tiny encoders, named image encoder_1 and image encoder_2. Image encoder_1 draws on the design concept of the MobileSAM model and achieves a significant reduction in the number of parameters while maintaining the performance of the model by distilling the original ViT in the SAM model. On the other hand, image encoder_2 is based on the EfficientSAM model and uses MAE pre-training and SAM’s mask image pre-training technology to achieve high-quality pre-training results.
A: Tongue-LiteSAM Model Structure. B: Training Principle of Image Encoder_1. Distillation uses Decoupled Distillation, which distills knowledge directly from the encoder part of the original Vision Transformer to a smaller encoder, without relying on the original Mask Decoder part. This method simplifies the distillation process, avoids the coupling optimization issues in traditional distillation, and can use a simpler mean squared error (MSE) loss function. The Prompt Guided Mask Decoder is a combination of the Prompt, Prompt Encoder, and Mask Decoder from part A. C: Training Principle of Image Encoder_2. During pre-training, the Cross Attention Decoder uses masked image tokens as queries, combining the unmasked and masked features output by the encoder as keys and values to reconstruct the masked tokens. This helps ViT-Tiny learn similar feature representation capabilities as the ViT Base Image Encoder, thereby improving its performance in visual representation and segmentation tasks.
When the image is input to these encoders, they process the image independently and generate feature representations. Finally, we merge the outputs of the two encoders to form the final image features. This dual encoder design strategy has the following theoretical basis and advantages:
Model diversity: We chose to use two ViT-Tiny encoders to explore model diversity, a concept that has been widely studied in deep learning. By fusing different perspectives from multiple models, we expect to capture richer and more diverse features. In theory, this multi-model combination helps reduce the risk of overfitting that a single model may encounter because it can model the input data from multiple perspectives.
Feature complementarity: We imagine that the two ViT-Tiny encoders can focus on extracting different types of features. For example, one encoder may be good at capturing the macro features of tongue images, while the other is better at detailed features. This division of labor design aims to enhance the comprehensiveness and richness of feature representation. According to this theory, the combination of different features can provide a more comprehensive perspective, thereby improving the overall performance of the model.
Ensemble learning: Ensemble learning methods have been shown to significantly improve model performance in machine learning. This method improves accuracy and robustness by combining the prediction results of multiple models. In our design, the dual encoders act like multiple base learners in ensemble learning. We expect to achieve better performance than a single encoder by integrating the outputs of the two encoders.
Parameter reduction: Since both encoders use lightweight ViT-Tiny, the overall model parameters are significantly reduced, which not only speeds up the inference speed, but also reduces the resource requirements of the model.
Through the above theoretical analysis and design considerations, we believe that the structure of the dual ViT-Tiny encoder is theoretically superior to a single or other configuration, which can bring better generalization performance and inference efficiency to our model.
B. Prompt Encoder Improvement: Bounding Box Training Strategy
The training and prediction process of the SAM model relies heavily on refined prompt engineering. During the training phase, we automatically generated prompting bounding boxes based on annotated mask images. During prediction, we used the Faster RCNN [48] object detection model to estimate the bounding boxes of the target. However, the object detection model may not guarantee precise localization of the bounding boxes on unseen datasets. Therefore, we adopted a specific bounding box training strategy when training the segmentation model.
Specifically, we first determined a perturbation factor. When generating bounding boxes automatically using mask images, the key data we obtained include: the top-left corner point (x_min, y_min) and the bottom-right corner point (x_max, y_max) of the bounding box, as well as the width W and height H of the image. The formula for perturbing the bounding box is as follows:\begin{align*} & Point_{top-left}(X_{min}\pm W\times Factor, Y_{min}\pm H\times Factor) \tag {1}\\[3pt] & Point_{bottom-right}(X_{max}\pm W\times Factor, Y_{max}\pm H\times Factor) \tag {2}\end{align*}
This method allowed us to introduce some flexibility during the training process, thereby enhancing the model’s robustness to slight variations in boundary positions. This step is crucial as it helps the model adapt to errors that may arise due to insufficient accuracy in object detection. Then, we input the perturbed bounding boxes into the model as prompt engineering.
Experiments
A. Datasets
We used six datasets in our experiments. The first dataset was split into a training set and an internal validation set in an 8:2 ratio. The other five datasets were external test sets, which tested the performance of model on zero-shot tasks. Figure 3 shows some sample images from the datasets. The datasets are described in detail as follows:
1) Tongue Image Dataset 1
Collected by the Third Affiliated Hospital of Beijing University of Chinese Medicine during clinical practice. The dataset consists of 642 images, with a resolution of
2) Tongue Image Dataset 2
A public dataset provided by Harbin Institute of Technology. The tongue images were collected using a professional tongue imaging device with constant lighting in a semi-enclosed state, totaling 300 images with a resolution of
3) Tongue Image Dataset 3
A public dataset provided by a GitHub user (without semantic segmentation mask annotations), consisting of 744 images with a resolution of
4) Tongue Image Dataset 4
A public dataset provided by a PaddlePaddle user (without semantic segmentation mask annotations), consisting of 800 images in JPG format with varying resolutions. Our team annotated the tongue images using the Labelme tool, and the annotations were confirmed and refined by medical experts. The final dataset consists of 800 tongue images and 800 corresponding mask images.
5) Tongue Image Dataset 5
A public dataset provided by a PaddlePaddle user (without semantic segmentation mask annotations), consisting of 1,440 images in JPG format with varying resolutions. Our team annotated the tongue images using the Labelme tool, and the annotations were confirmed and refined by medical experts. The final dataset consists of 1,440 tongue images and 1,440 corresponding mask images.
6) Tongue Image Dataset 6
Collected by our team using web scraping techniques. After data cleaning, we obtained 965 tongue images in JPG format with varying resolutions. The images were annotated by professionals using the Labelme tool, and the annotations were confirmed and refined by medical experts. The final dataset consists of 965 tongue images and 965 corresponding mask images.
B. Experimental Evaluation Metrics
1) IoU (Intersection Over Union)
A metric used to measure the overlap between two object detection boxes. It is defined as the ratio of the intersection area to the union area of two bounding boxes. The value ranges from 0 to 1, with higher values indicating a higher degree of overlap between the predicted and true results, i.e., better segmentation accuracy. The formula is:\begin{equation*} IoU(A, B)= \frac {\left |{{ A\cap B }}\right |}{\left |{{ A\cup B }}\right |} \tag {3}\end{equation*}
2) MIoU (Mean Intersection Over Union)
The average IoU value across multiple categories, used to evaluate the overall performance of the image segmentation task. It provides a comprehensive evaluation of the segmentation performance, avoiding potential biases in individual category IoU values. The formula is:\begin{equation*} mIoU(A,B)= \frac {\sum \nolimits _{i-1}^{N} {IoU\left ({{ A^{i},B^{i} }}\right )}}{N} \tag {4}\end{equation*}
3) Dice Similarity Coefficient (DSC)
A commonly used metric to measure the overlap or similarity between two images. The DSC ranges from 0 to 1, with values closer to 1 indicating higher similarity of the two sets, i.e., higher overlap between the segmentation result and the ground truth. The formula is:\begin{equation*} Dice(A,B)= \frac {2\left |{{ A\cap B }}\right |}{\left |{{ A }}\right |+\left |{{ B }}\right |} \tag {5}\end{equation*}
4) Mann-Whitney U Test
A non-parametric test used to compare the medians of two independent samples to determine if there is a significant difference. It does not assume normal distribution and is suitable for continuous or ordinal data. In this study, it is used to compare whether there are significant differences between the segmentation results of two models or between the results before and after model improvement. In the Mann-Whitney U Test, we usually focus on two metrics: P-value and Effect Size (Cohen’s d).
5) P-Value
An important indicator in statistical hypothesis testing, used to determine the probability of observing the data under the null hypothesis (H0). In the Mann-Whitney U Test, the P-value is used to assess whether the median difference between two samples is statistically significant. If the P-value is less than the preset significance level (usually 0.05), the null hypothesis is rejected, indicating a significant difference in the medians of the two samples. Conversely, if the P-value is greater than or equal to the significance level, the null hypothesis cannot be rejected, indicating no significant difference in the medians.
6) Effect Size (Cohen’s d)
Effect size measures the magnitude of the treatment effect and assesses the practical significance of the experimental results. In the Mann-Whitney U Test, Cohen’s d is a commonly used effect size metric, measuring the size of the median difference between two samples. The formula for Cohen’s d is:\begin{equation*} d = \frac {M1 - M2}{S_{pooled}} \tag {6}\end{equation*}
C. Explainability Analysis Method
Grad-CAM [49] is a visual method for explaining the predictions of deep learning models. It calculates and visualizes the gradient weights associated with specific categories in the last layer, generates a heat map to highlight the areas in the input image that are most critical to the model decision, and thus provides an intuitive and effective model explainability analysis method, which helps to improve the transparency and credibility of the model and has practical application value in aspects such as model optimization and error analysis.
D. Experimental Environment
The experimental system environment configuration is shown in Table 1 below.
Results
A. Comparison of Perturbation Factors in Bounding Box Training Strategy
The selection of the perturbation factor is crucial when training with the bounding box training strategy. To determine the optimal perturbation factor, this study used Tongue-LiteSAM (Ours) on the first dataset with perturbation factors of 0, 0.1, 0.2, 0.3, and 0.4, training the model and making predictions on an internal validation set and five external test sets. The results are shown in Table 2 below.
The experimental results show that the model achieved the highest accuracy with a perturbation factor of 0.2 on Dataset 1, and the highest accuracy with perturbation factors of 0.4, 0.3, 0.1, 0.4, and 0.4 on Datasets 2 to 6, respectively. Subsequent experiments will be conducted using the optimal perturbation factor for each dataset.
To further reveal the choice of perturbation factors, we show the mAP50-95 index of Faster R-CNN trained on dataset 1 and performing rectangular box prediction on six datasets, as well as the optimal perturbation factor determined based on the experimental results, as shown in Table 3.
From the above table, we can observe that the mAP50-95 value is not directly linearly related to the choice of perturbation factor. However, we can draw the following observations:
Relationship between mAP value and perturbation factor: On most datasets, datasets with higher mAP values (such as Dataset 4) have smaller optimal perturbation factors (such as 0.1), while datasets with lower mAP values (such as Dataset 2) have larger optimal perturbation factors (such as 0.4). This shows that the choice of perturbation factor is related to the difficulty of the dataset (the harder the dataset, the harder it is for Faster RCNN to predict the rectangle).
Theoretical explanation: This phenomenon may be because on easier datasets, the prediction box hint is very accurate, and the rectangle obtained by the smaller perturbation factor is similar to the predicted rectangle, so the smaller perturbation factor is sufficient; on more difficult datasets, the model may need a larger perturbation to be similar to the predicted rectangle.
B. Model Comparison
This experiment used a total of seven models, including U-Net, DeeplabV3, Swin Transformer [50], MobileSAM, SAM, TongueSAM and Tongue-LiteSAM (Ours), to conduct experiments on six datasets, and conduct non-parametric tests on Tongue-LiteSAM and six classic models (U-Net, DeeplabV3, Swin Transformer, MobileSAM, SAM, TongueSAM). We designed the experiment into two parts: one is for the case with domain-specific knowledge, that is, the supervised learning scenario; the other is for the zero-shot learning scenario, in which the model needs to make predictions without direct prior knowledge. Table 4 shows the experimental results in the scenario with specific knowledge, and Table 5 shows the experimental results in the zero-shot scenario.
The experimental data clearly show that in six different datasets, the Tongue-LiteSAM model significantly outperforms the other six comparison models in multiple performance indicators, and its performance is particularly outstanding in zero-shot task dataset 6. On this dataset, Tongue-LiteSAM’s IoU-Tongue and mIoU indicators far exceed the other five models except TongueSAM. In contrast, Tongue-LiteSAM’s performance in IoU-Tongue indicators is 28% higher than that of the Swin Transformer model and 10% higher than that of MobileSAM. In Dataset 1, although Tongue-LiteSAM performs better than the other six models, its result value does not have a significant statistical difference compared with TongueSAM. On the other five datasets, Tongue-LiteSAM has become the best model in the evaluation with its excellent performance, and this result is also statistically significant.
Further model performance comparison revealed Tongue-LiteSAM’s significant advantage in tongue image segmentation accuracy. As shown in the prediction results in Figure 4, models like U-Net struggled with precise segmentation of tongue edges, correct segmentation of the tongue root, and avoiding spots within the tongue. Particularly on dataset 6, the five models except Tongue-LiteSAM and TongueSAM, performed poorly in tongue image segmentation, almost failing to correctly segment the tongue. In contrast, our improved Tongue-LiteSAM model accurately segmented the tongue, demonstrating its excellent ability in zero-shot tasks.
Comparison of segmentation effects between Tongue-LiteSAM and the other six models.
The Tongue-LiteSAM model exceled in the precision and detail handling of tongue image segmentation. It was particularly adept at processing the tongue edge area, handling edges smoothly and naturally, effectively avoiding common segmentation errors. Moreover, the Tongue-LiteSAM model accurately identified and correctly segmented details such as teeth and lips at the edge of the tongue, effectively classifying them as background, thereby improving segmentation accuracy and reliability. All these factors underscore Tongue-LiteSAM’s leading position in tongue image segmentation, providing strong technical support for related research and applications due to its high precision and strong zero-shot segmentation capabilities.
C. Ablation Study
To verify whether the improvement strategies positively impact the model, the Tongue-LiteSAM (Ours) model was trained on one dataset and tested on six datasets. For each test dataset, the disturbance factor from the bounding box test was selected as the optimal disturbance factor. During non-parametric testing, hypothesis testing was conducted between SAM and SAM+ new_encoder, and between SAM+new_encoder and SAM+new_encoder+bounding_box, to validate the effectiveness of each improvement. The results are shown in Table 6 and Figure 5.
According to the experimental results, using the new image encoder improved the IoU_Tongue and mIoU performance of the model on one internal validation set and five external test sets. This indicates that replacing the image encoder significantly enhanced the model’s segmentation accuracy and generalization performance. Especially on datasets 5 and 6, the new encoder improved the model’s IoU_Tongue by approximately 4%, with improvements ranging from 1% to 2% on the other four datasets, all with statistical significance. Additionally, the new encoder reduced the model’s parameter count from 94M to 16M, an impressive 81.2% reduction, significantly alleviating the model’s parameter burden.
Introducing the bounding box training strategy further significantly improved the IoU_Tongue and mIoU on the internal validation set and external test sets. This further confirms the substantial enhancement effect of the new encoder on the model’s segmentation accuracy and generalization ability. Particularly on datasets 5 and 6, this strategy increased the model’s IoU_Tongue by about 8%, with improvements ranging from 1% to 5% on the other datasets, all statistically significant.
D. Explainability Analysis
In the prediction phase of semantic segmentation, its internal operating mechanism is often difficult to understand intuitively, like a closed black box. In order to improve the interpretability of the model, the Grad-CAM model used heat map visualization to intuitively present the internal decision-making process of the training model. This study aims to evaluate whether the adopted network model has effectively learned the expected features. Therefore, we combined DeepLabV3, SAM, Tongue-SAM and our custom model Tongue-LiteSAM with the Grad-CAM model for prediction. This method can not only intuitively reveal the contribution of different areas in the image to the segmentation results, but also show the focus of the network during training. The Grad-CAM heat map result is shown in Figure 6.
Input is the original image to be predicted. The heat map of Tongue-LiteSAM (Ours) in the tongue area is brighter, and there are basically no blue dots in other interference areas, which is a good effect. The heat maps of other models have many blue dots distributed around the image, and the color of the heat map of the tongue is quite different from that of Tongue-LiteSAM, indicating that the improved model can more easily learn the characteristics of the tongue and is less likely to be disturbed by information from other interfering parts, proving that the model improvement does guide the network to capture key features during the network learning process.
E. Analysis of Model Limitations and Discussion of Improvement Strategies
In this study, we deeply analyzed the cases with poor segmentation results, focusing specifically on the poorly performing segmentation samples in datasets 3 and 4, as shown in Figure 7. First, it is worth noting that dataset 1 benefited from the precise collection of professional tongue image collectors, and its light source was stable and avoided interference from direct sunlight, thus ensuring data consistency and high quality. However, in dataset 3, due to the influence of direct sunlight, the face and tongue surfaces of the characters were unevenly illuminated, which directly led to the unclear segmentation boundary between the left edge of the tongue and the facial skin. Furthermore, in some samples of dataset 4, the tongue was not clean enough and food residues remained on the surface, which also significantly affected the accuracy of segmentation. Based on the above analysis, we propose the following improvement measures: First, more tongue images collected under different light sources should be included in the training dataset to enhance the model’s adaptability to different lighting environments. Second, adding tongue images containing food residues will help the model learn to identify and deal with these interference factors, thereby improving the stability of the segmentation results. Through such data augmentation, we hope to significantly improve the segmentation performance of the model in complex environments, ensuring it has stronger robustness and generalization capabilities.
Discussion
The emergence of SAM has brought infinite possibilities to image segmentation. The model’s outstanding performance in zero-shot segmentation tasks has provided great inspiration for the widespread use of segmentation models. However, the large number of parameters inherent to the model itself severely limits its practicality. Therefore, reducing the number of parameters while maintaining high accuracy and generalization ability has become an important direction for improving the SAM model. Our proposed dual ViT-Tiny image encoder structure significantly reduces the model parameters while greatly improving accuracy, which is highly beneficial for the widespread use of SAM. Additionally, in the bounding box training section, we introduced data perturbation operations, providing an excellent solution for SAM to achieve outstanding results in more zero-shot segmentation tasks.
Based on the aforementioned experimental results, our model significantly outperforms convolutional neural network-based segmentation models (U-Net, Deeplabv3) and the transformer-based segmentation model (Swin Transformer) in terms of segmentation accuracy, and it also shows considerable improvement compared to existing SAM-related improved models (SAM, MobileSAM and TongueSAM). Non-parametric tests and validation results further confirm the reliability of these experimental results. From the sample data presented across six datasets, there are significant differences between the datasets. Compared to other datasets, dataset 3 has a more complex surrounding environment, and dataset 2 shows greater variations in light brightness. This significantly impacts the performance of segmentation models. While other models showed considerable performance variation across different datasets, our model maintained an accuracy of over 92%, which fully demonstrates that our model has extremely strong zero-shot segmentation capabilities. Furthermore, the results of ablation experiments clearly indicate that our proposed dual ViT-Tiny image encoder structure and data perturbation operations have a significant positive effect on enhancing model performance. These improvements have been validated across multiple datasets, further confirming their effectiveness in improving the model’s zero-shot segmentation ability and accuracy. At the same time, the explainability analysis results of the Grad-CAM heat map clearly reveal the advantages of the Tongue-LiteSAM model in feature recognition, illustrating the model’s superior performance in key feature extraction and response compared to other models.
Although image segmentation models based on ViT have significant advantages in segmentation accuracy and zero-shot segmentation ability compared to traditional convolutional neural network models, their complex structures often result in longer computation times, severely limiting their application in real-world scenarios. This issue is particularly pronounced for large models like SAM. Although we have improved SAM model, significantly reducing the time required for segmentation tasks, in some extreme cases (such as excessive lighting, insufficient tongue cleanliness, etc.), the segmentation effect of the model will still be greatly reduced. In order to better meet the needs of some complex segmentation tasks in actual situations, the research and optimization of the segmentation model still need to continue.
Conclusion
We explored the importance of tongue image segmentation in training tongue diagnosis models and pointed out that the existing convolutional neural network-based tongue segmentation models are inferior to ViT-based models in zero-shot segmentation tasks. Inspired by the SAM model and its improved versions, we improved the image encoder module of the original SAM model and introduced data perturbation operations in the bounding box training section to construct a new tongue image segmentation model, Tongue-LiteSAM. Comparative experiments on six different datasets show that Tongue-LiteSAM exhibits excellent performance in tongue image segmentation tasks. Additionally, ablation experiments and explainability analysis further validate the effectiveness of the proposed model improvement methods. In summary, Tongue-LiteSAM not only improves segmentation accuracy and zero-shot segmentation capability but also provides valuable reference points for further research in the field of tongue image segmentation.
ACKNOWLEDGMENT
(Daiqing Tan and Hao Zang are co-first authors.)