Image generation has evolved rapidly over the past decade. GANs (Generative Adversarial Networks, 2014) were the first major breakthrough, with StyleGAN showing impressive face synthesis. VAEs (Variational Autoencoders) offered another early approach but struggled with image quality. The field shifted dramatically around 2020-2021 with diffusion models, which became dominant after DALL-E 2, Stable Diffusion, and Midjourney demonstrated superior quality and controllability. Transformers also entered the space with models like DALL-E and later improvements.
Main Types of Image/Video Generation Models:
- Diffusion Models
- Denoising Diffusion Probabilistic Models (DDPM)
- Latent Diffusion Models (LDMs) - like Stable Diffusion
- Variants: DALL-E 2, Imagen, Midjourney
- GANs (Generative Adversarial Networks)
- StyleGAN series
- BigGAN
- Progressive GAN
- Autoregressive Models
- DALL-E (original)
- Parti
- Generate images token-by-token
- VAE-based
- Vector Quantized VAEs (VQ-VAE)
- Often used as components in other models
- Flow-based Models
- Normalizing flows
- Less common now but theoretically elegant
- Video Generation
- Diffusion-based: Stable Video Diffusion, Pika, Runway Gen-2
- Autoregressive: Phenaki
- Hybrid approaches: Sora (diffusion transformer)
huggingface Diffusers
- huggingface Diffusers Docs
- https://huggingface.co/Tongyi-MAI/Z-Image
- https://huggingface.co/Tongyi-MAI/Z-Image-Turbo
- https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
- https://huggingface.co/stabilityai/sd-turbo
- https://huggingface.co/stabilityai/sdxl-turbo