Conferences >2024 IEEE/CVF Conference on C...

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts, models such as Stable Diffusion have been ...Show More

Metadata

Abstract:

Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts, models such as Stable Diffusion have been successfully extended to the joint generation of se-mantic segmentation pseudo-masks. However, current ex-tensions primarily rely on extracting attentions linked to prompt words used for image synthesis. This approach limits the generation of segmentation masks derived from word tokens not contained in the text prompt. In this work, we introduce Open- Vocabulary Attention Maps (OVAM)-a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. In addition, we propose a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation. We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions. The best-performing model im-proves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks, demonstrating that our optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining. The implementation is available at github.com/vpulablovam.

Published in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 16-22 June 2024

Date Added to IEEE Xplore: 16 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52733.2024.00883

Conference Location: Seattle, WA, USA

Contents

1. Introduction

The introduction of diffusion models has led to a sig-nificant advancement in text-to-image (T2I) generation [7]. Diffusion-based models, such as Stable Diffusion [39] and other contemporary works [22], [27], [30], [32], [37], [38], [41], have been rapidly adopted across the research community and industry, owing to their ability to generate high-quality images that accurately reflect the semantics of text prompts. Figure 1.

(a) We introduce Open-Vocabulary Attention Maps (OVAM), a training-free extension for text-to-image diffusion models to gen-erate text-attribution maps based on open-vocabulary descriptions. Our approach overcomes the limitations of existing methods constrained by words contained within the prompt [45], [51], [55], [60]. (b) Our token optimization process enhances the creation of accurate attention maps, thereby improving the performance of existing semantic segmentation methods based on diffusion attentions [19], [45], [54], [60]. (c) Finally, we validate the utility of OVAM in producing synthetic images with precise pixel-level annotations.

References is not available for this document.

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

Authors

Figures

References

Keywords

Metrics

Supplemental Items

References

IEEE Account

Purchase Details

Profile Information

Need Help?