LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation | IEEE Conference Publication | IEEE Xplore

LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation


Abstract:

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex sce...Show More

Abstract:

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion out-performs the previous SOTA methods on FID, CAS by relatively 46.35%,26.70% on COCO-stuff and 44.29%,41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.
Date of Conference: 17-24 June 2023
Date Added to IEEE Xplore: 22 August 2023
ISBN Information:

ISSN Information:

Conference Location: Vancouver, BC, Canada

Funding Agency:


1. Introduction

Recently, the diffusion model has achieved encouraging progress in conditional image generation, especially in text-to-image generation such as GLIDE [24], Imagen [31], and Stable Diffusion [30]. However, text-guided diffusion models may still fail in the following situations. As shown in Fig. 1(a), when aiming to generate a complex image with multiple objects, it is hard to design a prompt properly and comprehensively. Even input with well-designed prompts, problems such as missing objects and incorrectly generating objects' positions, shapes, and categories still occur in the state-of-the-art text-guided diffusion model [24], [30], [31].

Contact IEEE to Subscribe

References

References is not available for this document.