Generative AI
Fundamentals

02 / 06 · 2026-06-30 · 18 min read · GANs Diffusion VAE Intermediate

In Post 01 you learned how a model learns from data. Now we go further: how do models create new data — images, text, audio — that never existed? This post breaks down the three families of generative AI: VAEs, GANs, and Diffusion Models, with diagrams and working code for each.

In this post

1. What Is Generative AI? 2. The Latent Space Idea 3. Variational Autoencoders (VAEs) 4. Generative Adversarial Networks (GANs) 5. Diffusion Models 6. Which One Should You Use? 7. Knowledge Check 8. What's Next

1. What Is Generative AI?

A discriminative model learns to classify: given an image, is it a cat or a dog? A generative model learns the underlying distribution of data — and can sample from it to produce new examples that look like they came from the training set.

The core question generative models answer: "What does data from this distribution look like?" Once a model understands that, it can generate unlimited new examples.

💡

Analogy: A discriminative model is a critic who can tell real art from fake. A generative model is an artist who has studied so much real art that they can paint convincingly from scratch.

Discriminative

Input → Label

Example: "Is this email spam?"

Generative

Noise / Condition → Data

Example: "Generate a photo of a cat."

2. The Latent Space Idea

All three generative architectures share a core insight: compress data into a lower-dimensional representation called a latent space, then learn to decode points in that space back into real data.

Think of it this way: a face image has 256×256×3 = ~200,000 pixel values. But the meaningful variation — age, expression, lighting — might be captured in just 128 numbers. That compressed representation is the latent vector.

High-dimensional data

Image: 256×256×3

⟶
Encoder

Latent vector

z: 128 dims

⟶
Decoder

Generated data

New image

⚡

Key property: nearby points in latent space produce similar outputs. This is what lets you interpolate between two faces, or smoothly morph one image into another.

3. Variational Autoencoders (VAEs)

A VAE is the simplest generative architecture. It has two parts: an encoder that maps data to a latent distribution, and a decoder that maps a sampled latent vector back to data.

The key trick: instead of encoding to a fixed point, the encoder outputs a mean (μ) and variance (σ²). During training we sample from that Gaussian. This forces the latent space to be continuous and well-organized — you can sample any point and get a valid output.

vae.py
import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim=784, latent_dim=32):
        super().__init__()
        # Encoder: data → mu and log_var
        self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
        self.fc_mu    = nn.Linear(256, latent_dim)
        self.fc_logvar = nn.Linear(256, latent_dim)
        # Decoder: z → reconstructed data
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid()
        )

    def reparameterize(self, mu, logvar):
        # Sample z = mu + eps * sigma  (keeps gradients flowing)
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        h    = self.encoder(x)
        mu, logvar = self.fc_mu(h), self.fc_logvar(h)
        z    = self.reparameterize(mu, logvar)
        return self.decoder(z), mu, logvar

The VAE Loss

VAE training minimizes two terms simultaneously:

Reconstruction Loss

How well does the decoder reproduce the original input? Uses binary cross-entropy or MSE.

KL Divergence

How close is the learned latent distribution to a standard Gaussian? Regularizes the latent space.

⚠️

VAE weakness: generated images tend to be blurry. Because the model averages over many possible reconstructions, fine details get smeared out. GANs and Diffusion models produce sharper results.

4. Generative Adversarial Networks (GANs)

Ian Goodfellow's 2014 insight: train two networks against each other. The Generator creates fake data. The Discriminator tries to distinguish real from fake. Each improves by competing with the other.

Random Noise z

Gaussian vector

⟶

Generator G

Upsamples to image

⟶

Fake Image

⟶

Discriminator D

Real or Fake?

Real images also feed into D directly from the training set

Training Dynamics

The Discriminator is trained to output 1 for real images and 0 for fakes. The Generator is trained to fool D — make it output 1 for fake images. At equilibrium, G produces images indistinguishable from real ones and D guesses 50/50 (the theoretical ideal — rarely achieved cleanly in practice).

⚠️

Mode collapse: the most common GAN failure. G learns to produce one type of image that always fools D, instead of diverse outputs. Tricks like minibatch discrimination and Wasserstein loss help.

Notable GAN variants

DCGAN — added convolutions, stable training

StyleGAN2 — photorealistic face generation, style mixing

Pix2Pix — image-to-image translation (sketch → photo)

CycleGAN — unpaired image translation (horse ↔ zebra)

5. Diffusion Models

Diffusion models (2020+) are now the dominant architecture for image generation, powering Stable Diffusion, DALL-E 3, and Midjourney. The idea: learn to reverse a noise-addition process.

Forward Process (Noising)

Gradually add Gaussian noise to an image over T steps until it becomes pure noise. This is fixed — no learning required.

🖼️

Clean image

t = 0

→

🌫️

Noisy image

t = T/2

→

❄️

Pure noise

t = T

Reverse Process (Denoising — what we train)

A neural network (typically a U-Net) learns to predict the noise added at each step. At inference, start from pure noise and iteratively denoise to get a clean image.

ddpm_inference.py
# Simplified DDPM sampling loop
def sample(model, timesteps=1000, shape=(1, 3, 64, 64)):
    x = torch.randn(shape)          # start from pure noise

    for t in reversed(range(timesteps)):
        t_tensor = torch.tensor([t])
        predicted_noise = model(x, t_tensor)
        # Remove predicted noise, add a little back for stochasticity
        x = denoise_step(x, predicted_noise, t)

    return x   # final clean image

Why Diffusion Won

No mode collapse

No adversarial game — training is stable and diverse outputs emerge naturally.

Conditioning is easy

Feed a text embedding at each step to guide generation — this is how text-to-image works.

6. Which One Should You Use?

Architecture	Quality	Speed	Stability	Best for
VAE	Medium	Fast	Stable	Latent space research, compression
GAN	High	Fast	Tricky	Image synthesis, style transfer
Diffusion	SOTA	Slow	Stable	Text-to-image, audio, video

For production image generation today, Diffusion models are the default. GANs still win on speed (single forward pass vs. 20–1000 diffusion steps). VAEs are the go-to when you need a compact, searchable latent space.

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5

What does the encoder in a VAE output?

Q2 of 5

In a GAN, what does the Discriminator try to do?

Q3 of 5

What is "mode collapse" in GAN training?

Q4 of 5

In diffusion models, what does the neural network actually learn to predict?

Q5 of 5

Which architecture is generally best for text-to-image generation today?

8. What's Next

You now understand all three generative AI families. In the next post we go practical: