Introduction
Semantic segmentation represents a crucial task in computer vision, which describes assigning class labels to each pixel of an image. Its applications span diverse domains, including robotics and satellite image analysis.
Despite its significance, current semantic segmentation methods still face several critical challenges. Firstly, these methods are high-cost, requiring pixel-level annotation and extensive training. Secondly, since supervised learning depends on a predefined set of categories, detecting extremely rare or completely new classes during prediction is virtually impossible.
Two related tasks were proposed to address these limitations: unsupervised and open-vocabulary semantic segmentation. Unsupervised semantic segmentation [1], [2], [3] avoids the expensive annotation process by using representations obtained through a backbone model [4], [5] trained on a different task. Open-vocabulary semantic segmentation [6], [7], [8], [9], [10] enables the identification of a wide array of categories through natural language and is not bound to a pre-defined set of categories.
However, there are still challenges to be solved with these methods. Unsupervised semantic segmentation clusters images by class but cannot identify the class of each cluster, while open-vocabulary segmentation assumes that text queries describing objects in the image are provided by the user. To address these challenges, zero-guidance segmentation emerged in [14], enabling open-vocabulary segmentation without the need for inputting class candidates (guidance), yet there is still room for improvement in terms of performance. We categorize these related works into four distinct areas in Table 1.
Based on these backgrounds, we further improved this approach by introducing a novel method named TAG, which offers higher performance and flexibility. As its primary strength, TAG achieves Training, Annotation, and Guidance-free open-vocabulary semantic segmentation. This method employs a novel approach by extracting semantic features from each pixel in an image using CLIP [13], and then retrieving the open-vocabulary classes based on these features from an external database [19], [20], [21], [22]. TAG operates using pre-trained frozen models CLIP [13] and DINOv2 [5], eliminating the need for an additional training process. CLIP [13] can identify diverse objects and scenes while its segmentation results are often coarse and noisy, necessitating refinement. We use DINOv2 [5] to excel in capturing fine details and global context, enabling precise segmentation. Combining these models leverages CLIP [13]’s generalization and DINO [5]’s detailed feature extraction for more accurate segmentation. These models do not utilize the dense and costly annotations traditionally required for semantic segmentation.
Furthermore, through the extensibility of its database, this method also incorporates flexibility, making it easy to adapt to new classes or scenarios. A major distinction between previous methods [14], [16], and our TAG is that it provides more flexibility as it can be extended to include new concepts by adding them to the database while previous methods require re-training. It is important to note that while the database used in TAG is finite, the language models like BLIP [17] or GPT [15] are also constructed from similarly finite datasets. In [23], it is even reported the retrieval-based methods provided superior results over BLIP [17] in the context of image classification.
Our TAG can segment an image into meaningful segments as shown in Figure 1 without any text guidance. In particular, TAG is able to accurately segment structures with their proper nouns, such as the Leaning Tower of Pisa and the Coliseum. In addition, TAG shows significant improvements in contrast to other comparable segmentation methods, i.e. on the PascalVOC [24] dataset (+15.3 mIoU).
Guidance-free Open-Vocabulary Semantic Segmentation. Our TAG can segment an image into meaningful segments without training, annotation, or guidance. It successfully segments structures such as the Leaning Tower of Pisa and the Colosseum. Unlike traditional open-vocabulary semantic segmentation methods, TAG can segment and categorize without text-guidance.
Our contributions are the following:
We propose a novel approach, namely TAG, to achieve open-vocabulary semantic segmentation that does not require pre-defined categories by retrieving segment categories from an external database.
TAG achieves compelling segmentation results for all categories in the wild without any additional training, high-cost dense annotation, or text query guidance.
TAG outperforms the previous state-of-the-art methods by 15.3 mIoU on the PascalVOC [24] dataset, demonstrating the superior segmentation performance of our proposed approach.
Related Work
A. Semantic Segmentation
Semantic segmentation is the task of assigning class labels to all pixels in an image, commonly using convolutional neural networks [11], [25] or vision transformers [26] for end-to-end training. These methods, while effective, depend on extensive annotation and significant computational resources for training, and are limited to predefined categories. Thus, unsupervised, and domain-flexible approaches have recently gained importance.
Unsupervised semantic segmentation [1], [2], [3], [27] attempts to solve semantic segmentation without using any kind of supervision. STEGO [2] and HP [3] optimize the head of a segmentation model using image features obtained from a backbone pre-trained by DINO [4] and DINOv2 [5], an unsupervised method for many tasks. However, unsupervised semantic segmentation clusters images by class but cannot identify the class of the each cluster. In contrast, our TAG distinguishes classes without extra training or annotation.
B. Open-Vocabulary Semantic Segmentation
Open vocabulary semantic segmentation, crucial for segmenting objects across domains without being limited to predefined categories, has seen notable advancements with the introduction of key methodologies [6], [7], [7], [8], [9], [10], [28], [29], [30], [31], [32], [33], [34].
Early attempts, such as ZS3Net [30] and SPNet [32], focused on zero-shot learning, training custom modules to bridge visual and language embedding spaces. These methods set the foundation for future improvements.
This area has seen significant improvement, particularly through integrating vision-language models like CLIP [13], which train visual and textual feature encoders on extensive image-text pairs. LSeg [33], OpenSeg [7], OPSNet [34], and OVSeg [7] have each contributed to the advancements in the field leveraging CLIP [13]. These methods typically generate class-agnostic masks before using CLIP [13] to classify each mask, demonstrating the versatility of CLIP [13] embeddings in open vocabulary semantic segmentation.
Moreover, MaskCLIP [9] and GEM [35] have highlighted the potential of using intermediate representations from a frozen CLIP [13] encoder to directly segment images without additional training, reducing both annotation and training costs. Concurrently, models like ODISE [8] have explored the integration of pre-trained diffusion models [12] with CLIP [13] to extend to high performance panoptic segmentation.
Despite these advancements, a limitation across these methods is their reliance on text input as guidance from users. Our TAG tackles this limitation and allows for open-vocabulary segmentation without text guidance. Closest and concurrent to our work is the zero-guidance semantic segmentation paradigm [14], in which clustered DINO [4] embeddings are combined with CLIP [13]. To generate captions from CLIP [13] features, ZeroSeg [14] uses ZeroCap [36] which combines a language model, GPT-2 [15], with CLIP [13]. It adjusts parts of GPT-2 [15] to finish the sentence, starting with “Image of a...” so that the sentence closely matches the images according to CLIP’s understanding.
However, there is still room for improvement in terms of performance. We hypothesize that the issue is related to the performance of ZeroCap [36]. Therefore, as a new method, our TAG uses a novel approach that retrieves categories from a database for estimating categories.
C. Text Retrieval From Clip Embedding
In natural language processing, retrieving information from external databases has been shown to boost the performance of large language models [37], [38], [39]. This concept is also explored in computer vision, particularly for addressing class imbalance by using databases to retrieve training samples or image-text pairs. RAC [40] and VIC [23] achieved image classification without relying on predefined classes by utilizing an external database. It has the advantage of low memory consumption because it only uses captions from databases like Public Multimodal Datasets (PMD) [19] collecting image-text pairs from different public datasets.
Method
Figure 2 shows an overview of our proposed method which we call TAG, a novel approach. Our TAG attempts to partition input images into semantic segments and label each segment with open-vocabulary categories. To this end, we propose to identify segment candidates using per-pixel features obtained from DINOv2 [5] (Sec. III-A), acquire representative segment embeddings for segment candidates using per-pixel features from a ViT pre-trained with CLIP [13] (Sec. III-B), and assign categories to each candidate segment by retrieving the closest matching sentence from an external database (Sec. III-C). Note that, unlike traditional open-vocabulary semantic segmentation, the input is only the image, with no need to input category candidates as guidance.
High-level overview of our TAG architecture. Our TAG can partition images into semantic segments and label each segment with open-vocabulary categories. First, TAG identifies segment candidates using per-pixel features obtained from DINOv2 [5]. Then, it acquires representative segment embeddings for segment candidates using per-pixel features from a ViT pre-trained with CLIP [13]. Finally, the categories are assigned to each candidate segment by retrieving the closest matching sentence from an external database. Note that the input is only the image, with no need to input category candidates as guidance.
A. Segment Candidates With DINO
It has been observed that segmentation results obtained from CLIP-based segmentation methods [9], [35] are fragmented and noisy as shown in Figure 4. Therefore, the first step in our TAG pipeline is calculating segmentation candidates to achieve more accurate segmentation results. To obtain more precise segmentation outcomes than CLIP-based methods without using dense annotations, we reference unsupervised segmentation methods [2], [3] and employ a ViT pre-trained with DINOv2 [5].
Overview of the flow for each segment. Each segment independently retrieves for category candidates and assigns a category.
The output of DINOv2 [5] is a feature map
B. Representative Segment Embeddings With Clip
CLIP [13] is a ViT model that can embed images and text into the same latent space. To assign natural language categories to each segment, we use CLIP [13] to embed the image at the pixel level.
Instead of directly acquiring pixel-level features from CLIP [13], we extract dense patch-level features from the image encoder of CLIP [13] following CLIP-based segmentation methods [9], [35]. The image encoder of CLIP [13] uses a multi-head attention layer, where the globally average-pooled feature works as the query, and the feature at each patch generates a key-value pair. Then, this layer outputs a spatial weighted sum of the incoming feature map followed by a linear layer \begin{align*} & \text {AttnPool}(\overline {q}, k, v) = F\left ({{\sum _{i} \text { softmax}\left ({{\frac {\overline {q} k_{i}^{T}}{C}}}\right) v_{i}}}\right) \\ & \qquad = \sum _{i} \text { softmax}\left ({{\frac {\overline {q} k_{i}^{T}}{C}}}\right) F(v_{i}), \tag {1}\\ & \overline {q} = \text {Emb}_{q}(\overline {x}), k_{i} = \text {Emb}_{k}(x_{i}),v_{i} = \text {Emb}_{v}(x_{i}), \tag {2}\end{align*}
Based on this observation, we utilize the features from the last attention layer of CLIP [13] image encoder by adopting the GEM [35] mechanism.
CLIP model in TAG outputs value features
Next, to assign categories to segment candidates, we calculate embedding features representing the segments from CLIP [13] per-pixel features \begin{equation*} \bar {\mathbf {f}}_{k} =\frac {1}{M_{k}} \sum _{h,w} m_{khw} \cdot \mathbf {f}_{hw}, \quad M_{k}=\sum _{h,w} m_{khw} \tag {3}\end{equation*}
C. Segment Category Retrieval
CLIP [13] can embed images and text in the same latent space, but the model itself cannot generate images or text from the embedded features. To address this challenge, our proposed method TAG finds the closest category using multi-modal data from large databases.
First, we retrieve a few of the most probable candidate classes from the large classification space.
Let D be the database of image captions. Given representative segment embedding \begin{equation*} D_{\bar {\mathbf {f}}_{k}} = \underset {\mathbf {d} \in D}{\text {top-}n} \: \frac {\bar {\mathbf {f}}_{k}^{T} \cdot \mathbf {f}_{d}}{\| \bar {\mathbf {f}}_{k} \| \cdot \|\mathbf {f}_{d} \|}, \quad \mathbf {f}_{d} = \text {CLIP}_{t}(d), \tag {4}\end{equation*}
Next, to extract candidate words
In the first operation, we remove all the irrelevant words, such as URLs, or file extensions.
Secondly, we align words referring to the same semantic class in a standardized format. Specifically, Converting upper case to lower case and plural words to singular format.
In the final operation, we filter out two types of words: rare and noisy categories based on the frequency of word occurrences, as well as entire categories of words determined by Part-Of-Speech (POS) [41] tagging. Frequency filtering involves retaining only those words that appear more than two times in the input text. If the threshold is set too high and no words meet the criterion, it is lowered to include at least the most frequently occurring words. The POS [41] tagging classifies words into groups like adjectives, articles, nouns, or verbs, allowing us to exclude any terms that do not hold semantic significance as segmentation categories.
Given candidate words \begin{equation*} W = \underset {\mathbf {c} \in C_{\bar {\mathbf {f}}_{k}^{T}}}{\text {argmax}} \: \frac {\bar {\mathbf {f}}_{k}^{T} \cdot \mathbf {f}_{c}}{\| \bar {\mathbf {f}}_{k} \| \cdot \|\mathbf {f}_{c} \|}, \quad \mathbf {f}_{c} = \text {CLIP}_{t}(c), \tag {5}\end{equation*}
Experiment
First, we present the implementation details in Section IV-A. Next, we compare our results to previous methods in Section IV-B and evaluate the open vocabulary aspect in Section IV-C. Finally, we justify the construction of TAG through an ablation study in Section IV-D.
A. Implementation Details
For our implementation of TAG, we employed a frozen pre-trained CLIP [13] and DINO [5] with ViT-L/14 architecture and input
B. Main Results
To validate the performance of TAG, we conducted comprehensive comparative experiments with its closest counterpart, ZeroSeg [14]. For settings, TAG uses a PMD database [19]. We set the number of k-means clusters as 15 and the frequency filtering threshold 2. For the evaluation, we used the mean Intersection over Union (mIoU) as the primary metric. The predicted text \begin{equation*} T_{i}^{*} = \underset {t \in T^{gt}}{\text {argmax}} \: [\text {cossim}^{\text {SBERT}}(T_{i}, t)]. \tag {6}\end{equation*}
We perform our experiments on the PascalVOC [24] dataset comprising 20 classes, PascalContext [43] with 59 classes, as well as ADE20K [44] consisting of 150 classes.
The qualitative results are shown in Figure 4 and Figure 5. In Figure 4, we compared TAG with CLIP [13] base open-vocabulary method, MaskCLIP [9], and GEM [35]. Using MaskCLIP [9] and GEM [35] results in a noisy and fragmented segmentation, whereas TAG achieves more consistent segments that better correspond with the shape of the object and segment categories. In Figure 5, we compare GroupViT [42], ZeroSeg [14], and our TAG on images containing general objects from PascalContext [43]. In the image (a), TAG is the only method that accurately recognizes a cow as a calf. In addition, TAG assigns the precise and relevant class ‘barn’ to the surroundings of the calf, unlike ZeroSeg which incorrectly includes the class ‘sleeping’. In image (b), TAG is the only method to correctly identify ‘sunglasses’, and it also accurately classifies the dog as a ‘bulldog’. However, in image (c), TAG does not distinguish between a desk and a chair but rather assigns the rough class ‘room’ to the entire space. Occasionally, as shown in (d), TAG assigns proper nouns such as ‘swindon’ and ‘swanage’ which are names of cities in South England. While TAG correctly identifies the background as the city ‘swanage’, the ground is incorrectly assigned to the city of ‘swindon’. We hypothesize this is caused by both segments being close in the CLIP [13] embedding space.
The quantitative results are shown in Table 2. TAG shows an improvement of +15.3 mIoU on PascalVOC [24], +0.6 mIoU on PascalContext [43], and +0.2 mIoU on ADE20K [44] compared to previous zero-guidance segmentation state-of-the-art results. In particular, TAG shows a dramatic performance improvement on PascalVOC [24], which was identified as a limitation in ZeroSeg [14]. Additionally, our TAG method has made significant improvements compared to untrained open-vocabulary segmentation methods, demonstrating an impressive improvement of +28.3 mIoU on PascalVOC [24], even without text-based guidance.
C. Open Vocabulary Segmentation on Web-Crawled Images
In this section, we thoroughly assess the performance of TAG using open vocabulary segmentation experiments on web-crawled images, where we test the model’s ability to accurately segment various unseen classes, including specific and detailed categories such as ‘joker’ and ‘porsche’.
The qualitative outcomes of the experiments are visually depicted in Figure 6. In this figure, (a) represents a general image, while (b) and (c) showcase images created by Stable Diffusion [12]. Furthermore, (d) and (e) show images containing proper nouns.
Open-vocabulary segmentation results. In (a) we test on a general image, (b) and (c) show images generated by Stable Diffusion [12], and (d) and (e) are images featuring specific proper nouns.
In image (a), although the complex concept of a ‘mirror’ is not captured, the segmentation successfully identifies both ‘cat’ and ‘bathroom’, resulting in accurate outcomes. In image (b), while failing to recognize the ‘astronaut’, the model aptly estimates the ground as the ‘moon’, leading to a logical result. Given TAG’s ability to identify the ground as the moon, it is evident that it can understand the whole image while generating segment embeddings. Image (c) showcases the precise segmentation of various foods. Image (d) impressively segments and identifies proper nouns such as ‘joker’ and ‘batman’, demonstrating remarkable results. Lastly, image (e), despite containing the specific proper noun ‘porsche’, is correctly recognized as a supercar, affirming the accuracy of the segmentation.
These findings serve as compelling evidence that TAG exhibits robust capabilities to accurately segmenting open vocabularies, including complex and specific categories, thus underscoring its versatility and effectiveness in handling diverse and intricate segmentation tasks.
D. Ablation Study
In this section, we perform ablation studies on TAG, examining how various databases, cluster numbers of k-means, and label reassignment of evaluation affect performance.
Table 3 presents the results of the ablation experiments comparing the effect of databases and cluster numbers of k-means on the mIoU. The results indicate that PMD [19] and CC12M [20] are preferable datasets for our database. These results also reveal that using PMD [19] as the database and setting the number of k-means clusters to 5 is the most robust choice, consistently yielding favorable outcomes across multiple datasets. Figure 7 shows the ablation qualitative results. Left image remains unchanged regardless of variations in the database or the number of clusters, while increasing the number of clusters has been observed to cause segments with the same semantic meaning, such as ‘apartment’ to be divided into different segments like ‘home’ or ‘house’ in right image.
Qualitative results of ablation study on PascalVOC [24]. The database and the number of k-means clusters are shown with the results.
Furthermore, we conduct ablation experiments on filtering operations and the number of captions to select used for segment category retrieval. Table 4 shows that utilizing all three filtering operations yields the best results. Similarly, we examine the effect of the threshold on the frequency filtering. The results are shown in Table 5 and indicate that using our default threshold of 2 is justified. In addition, Table 6 shows the number of captions to select and indicates that our default setting threshold to 10 is appropriate..
For evaluation, the predicted text
Limitation
While TAG achieves remarkable results, our proposed method still comes with certain limitations. First, as shown in Table 3, TAG depends on the choice of the database, making it challenging to select the optimal database for unknown domains without information on test labels. On the other hand, TAG can flexibly address this limitation by adding new concepts into the database without retraining, unlike language-based methods [14], [16]. Second, TAG does not distinguish between different levels of class granularity. As shown in Figure 6 (e), TAG predicted both ‘Porsche’ and ‘Lamborghini’ as ‘supercar’. While the predicted categories in the qualitative results are consistently correct, they may not always align with the optimal category desired by the user. Future works might address this issue by considering the frequency of words within the database.
Conclusion
In this study, we proposed TAG, Training, Annotation, and Guidance-free open-vocabulary semantic segmentation. TAG employs a novel approach by extracting semantic features from each pixel in an image using CLIP [13], and then retrieving the open-vocabulary categories based on these features from an external database. Through a series of comprehensive experiments and analyses, we have demonstrated the effectiveness and versatility of TAG across various datasets and challenging segmentation tasks. Our results indicate that TAG exhibits robust performance in handling diverse categories, including general classes and fine-grained, proper noun-based segments.
Overall, our findings highlight the potential of TAG as a powerful and effective tool in the field of semantic segmentation. By retrieving the open-vocabulary categories, we have successfully demonstrated the model’s capability to handle diverse datasets and open vocabularies without text guidance, paving the way for future advancements and applications in this critical area of computer vision.