Abstract:
Recent advances in text-to-image synthesis have captivated audiences worldwide, drawing considerable attention. Although significant progress in generating photo-realisti...Show MoreMetadata
Abstract:
Recent advances in text-to-image synthesis have captivated audiences worldwide, drawing considerable attention. Although significant progress in generating photo-realistic images through large pre-trained autoregressive and diffusion models, these models face three critical constraints: (1) The requirement for extensive training data and numerous model parameters; (2) Inefficient, multi-step image generation process; and (3) Difficulties in controlling the output visual features, requiring complexly designed prompts to ensure text-image alignment. Addressing these challenges, we introduce the CLIP-GAN model, which innovatively integrates the pretrained CLIP model into both the generator and discriminator of the GAN. Our architecture includes a CLIP-based generator that employs visual concepts derived from CLIP through text prompts in a feature adapter module. We also propose a CLIP-based discriminator, utilizing CLIP's advanced scene understanding capabilities for more precise image quality evaluation. Additionally, our generator applies visual concepts from CLIP via the Text-based Generator Block (TG-Block) and the Polarized Feature Fusion Module (PFFM) enabling better fusion of text and image semantic information. This integration within the generator and discriminator enhances training efficiency, enabling our model to achieve evaluation results not inferior to large pre-trained autoregressive and diffusion models, but with a 94\% reduction in learnable parameters. CLIP-GAN aims to achieve the best efficiency-accuracy trade-off in image generation given the limited resource budget. Extensive evaluations validate the superior performance of the model, demonstrating faster image generation speed and the potential for greater stylistic diversity within the GAN model, while still preserving its smooth latent space.
Published in: IEEE Transactions on Multimedia ( Early Access )