About Imagen by Google

Imagen is an advanced text-to-image diffusion model developed by the Google Research Brain Team. It leverages the power of large transformer language models to understand text and the strength of diffusion models for high-fidelity image generation. The system is designed to create photorealistic images from input text, offering an unprecedented degree of photorealism and a deep level of language understanding.


  1. Photorealistic Image Generation: Imagen uses a large frozen T5-XXL encoder to encode the input text into embeddings. A conditional diffusion model then maps the text embedding into a 64×64 image. The system further utilizes text-conditional super-resolution diffusion models to upsample the image to higher resolutions.
  2. Deep Textual Understanding: Imagen demonstrates that large pretrained frozen text encoders are highly effective for the text-to-image task. Scaling the pretrained text encoder size is more important than scaling the diffusion model size.
  3. State-of-the-Art Performance: Imagen achieves a new state-of-the-art Frechet Inception Distance (FID) score of 7.27 on the COCO dataset, without ever training on COCO. Human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. In side-by-side comparisons, human raters prefer Imagen over other models in terms of sample quality and image-text alignment.