1. Introduction
The introduction of diffusion models has led to a sig-nificant advancement in text-to-image (T2I) generation [7]. Diffusion-based models, such as Stable Diffusion [39] and other contemporary works [22], [27], [30], [32], [37], [38], [41], have been rapidly adopted across the research community and industry, owing to their ability to generate high-quality images that accurately reflect the semantics of text prompts.
(a) We introduce Open-Vocabulary Attention Maps (OVAM), a training-free extension for text-to-image diffusion models to gen-erate text-attribution maps based on open-vocabulary descriptions. Our approach overcomes the limitations of existing methods constrained by words contained within the prompt [45], [51], [55], [60]. (b) Our token optimization process enhances the creation of accurate attention maps, thereby improving the performance of existing semantic segmentation methods based on diffusion attentions [19], [45], [54], [60]. (c) Finally, we validate the utility of OVAM in producing synthetic images with precise pixel-level annotations.