Abstract:
Diffusion transformer (DiT) architecture catches much attention in image generation, which achieves better fidelity, performance, and diversity. However, most existing Di...Show MoreMetadata
Abstract:
Diffusion transformer (DiT) architecture catches much attention in image generation, which achieves better fidelity, performance, and diversity. However, most existing DiT-based image generation methods are global-aware synthesis and regional prompt control is less explored. In this paper, we propose a coarse-to-fine generation pipeline for regional prompt-following generation. Specifically, we first leverage the powerful large language model (LLM) to generate the high-level description of image (such as content, topic, and objects) and low-level description of image (such as details and style). Then we explore the influence of cross-attention layers in different depths. We discover that deeper layers always responsible for the high-level content control, while the shallow layers handles low-level content control. The various prompts are injected into the proposed regional cross-attention control in order for course-to-fine generation. Using the proposed pipeline, we improve the controllability of DiT-based image generation. Extensive quantitative and qualitative results demonstrate that our pipeline enables to improve the generated performance. Our codes are available at https://github.com/ZhenXiong-dl/ICASSP2025-RCAC.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: