Topics Building AI Model from Scratch
Series · 6 posts Contact

Generative AI
Fundamentals

In Post 01 you learned how a model learns from data. Now we go further: how do models create new data — images, text, audio — that never existed? This post breaks down the three families of generative AI: VAEs, GANs, and Diffusion Models, with diagrams and working code for each.

1. What Is Generative AI?

A discriminative model learns to classify: given an image, is it a cat or a dog? A generative model learns the underlying distribution of data — and can sample from it to produce new examples that look like they came from the training set.

The core question generative models answer: "What does data from this distribution look like?" Once a model understands that, it can generate unlimited new examples.

💡

Analogy: A discriminative model is a critic who can tell real art from fake. A generative model is an artist who has studied so much real art that they can paint convincingly from scratch.

Discriminative

Input → Label

Example: "Is this email spam?"

Generative

Noise / Condition → Data

Example: "Generate a photo of a cat."

2. The Latent Space Idea

All three generative architectures share a core insight: compress data into a lower-dimensional representation called a latent space, then learn to decode points in that space back into real data.

Think of it this way: a face image has 256×256×3 = ~200,000 pixel values. But the meaningful variation — age, expression, lighting — might be captured in just 128 numbers. That compressed representation is the latent vector.

High-dimensional data
Image: 256×256×3

Encoder
Latent vector
z: 128 dims

Decoder
Generated data
New image

Key property: nearby points in latent space produce similar outputs. This is what lets you interpolate between two faces, or smoothly morph one image into another.

3. Variational Autoencoders (VAEs)

A VAE is the simplest generative architecture. It has two parts: an encoder that maps data to a latent distribution, and a decoder that maps a sampled latent vector back to data.

The key trick: instead of encoding to a fixed point, the encoder outputs a mean (μ) and variance (σ²). During training we sample from that Gaussian. This forces the latent space to be continuous and well-organized — you can sample any point and get a valid output.

vae.py
import torch import torch.nn as nn class VAE(nn.Module): def __init__(self, input_dim=784, latent_dim=32): super().__init__() # Encoder: data → mu and log_var self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU()) self.fc_mu = nn.Linear(256, latent_dim) self.fc_logvar = nn.Linear(256, latent_dim) # Decoder: z → reconstructed data self.decoder = nn.Sequential( nn.Linear(latent_dim, 256), nn.ReLU(), nn.Linear(256, input_dim), nn.Sigmoid() ) def reparameterize(self, mu, logvar): # Sample z = mu + eps * sigma (keeps gradients flowing) std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std def forward(self, x): h = self.encoder(x) mu, logvar = self.fc_mu(h), self.fc_logvar(h) z = self.reparameterize(mu, logvar) return self.decoder(z), mu, logvar

The VAE Loss

VAE training minimizes two terms simultaneously:

Reconstruction Loss
How well does the decoder reproduce the original input? Uses binary cross-entropy or MSE.
KL Divergence
How close is the learned latent distribution to a standard Gaussian? Regularizes the latent space.
⚠️

VAE weakness: generated images tend to be blurry. Because the model averages over many possible reconstructions, fine details get smeared out. GANs and Diffusion models produce sharper results.

4. Generative Adversarial Networks (GANs)

Ian Goodfellow's 2014 insight: train two networks against each other. The Generator creates fake data. The Discriminator tries to distinguish real from fake. Each improves by competing with the other.

Random Noise z
Gaussian vector
Generator G
Upsamples to image
Fake Image
Discriminator D
Real or Fake?
Real images also feed into D directly from the training set

Training Dynamics

The Discriminator is trained to output 1 for real images and 0 for fakes. The Generator is trained to fool D — make it output 1 for fake images. At equilibrium, G produces images indistinguishable from real ones and D guesses 50/50 (the theoretical ideal — rarely achieved cleanly in practice).

⚠️

Mode collapse: the most common GAN failure. G learns to produce one type of image that always fools D, instead of diverse outputs. Tricks like minibatch discrimination and Wasserstein loss help.

Notable GAN variants

DCGAN — added convolutions, stable training
StyleGAN2 — photorealistic face generation, style mixing
Pix2Pix — image-to-image translation (sketch → photo)
CycleGAN — unpaired image translation (horse ↔ zebra)

5. Diffusion Models

Diffusion models (2020+) are now the dominant architecture for image generation, powering Stable Diffusion, DALL-E 3, and Midjourney. The idea: learn to reverse a noise-addition process.

Forward Process (Noising)

Gradually add Gaussian noise to an image over T steps until it becomes pure noise. This is fixed — no learning required.

🖼️
Clean image
t = 0
🌫️
Noisy image
t = T/2
❄️
Pure noise
t = T

Reverse Process (Denoising — what we train)

A neural network (typically a U-Net) learns to predict the noise added at each step. At inference, start from pure noise and iteratively denoise to get a clean image.

ddpm_inference.py
# Simplified DDPM sampling loop def sample(model, timesteps=1000, shape=(1, 3, 64, 64)): x = torch.randn(shape) # start from pure noise for t in reversed(range(timesteps)): t_tensor = torch.tensor([t]) predicted_noise = model(x, t_tensor) # Remove predicted noise, add a little back for stochasticity x = denoise_step(x, predicted_noise, t) return x # final clean image

Why Diffusion Won

No mode collapse
No adversarial game — training is stable and diverse outputs emerge naturally.
Conditioning is easy
Feed a text embedding at each step to guide generation — this is how text-to-image works.

6. Which One Should You Use?

Architecture Quality Speed Stability Best for
VAE Medium Fast Stable Latent space research, compression
GAN High Fast Tricky Image synthesis, style transfer
Diffusion SOTA Slow Stable Text-to-image, audio, video

For production image generation today, Diffusion models are the default. GANs still win on speed (single forward pass vs. 20–1000 diffusion steps). VAEs are the go-to when you need a compact, searchable latent space.

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5
What does the encoder in a VAE output?
Q2 of 5
In a GAN, what does the Discriminator try to do?
Q3 of 5
What is "mode collapse" in GAN training?
Q4 of 5
In diffusion models, what does the neural network actually learn to predict?
Q5 of 5
Which architecture is generally best for text-to-image generation today?

8. What's Next

You now understand all three generative AI families. In the next post we go practical:

← Previous How AI Models Are Built
Next → Working with LLMs