1. Introduction
Recently, the diffusion model has achieved encouraging progress in conditional image generation, especially in text-to-image generation such as GLIDE [24], Imagen [31], and Stable Diffusion [30]. However, text-guided diffusion models may still fail in the following situations. As shown in Fig. 1(a), when aiming to generate a complex image with multiple objects, it is hard to design a prompt properly and comprehensively. Even input with well-designed prompts, problems such as missing objects and incorrectly generating objects' positions, shapes, and categories still occur in the state-of-the-art text-guided diffusion model [24], [30], [31].