Generative AI
Fundamentals
1. What Is Generative AI?
A discriminative model learns to classify: given an image, is it a cat or a dog? A generative model learns the underlying distribution of data — and can sample from it to produce new examples that look like they came from the training set.
The core question generative models answer: "What does data from this distribution look like?" Once a model understands that, it can generate unlimited new examples.
Analogy: A discriminative model is a critic who can tell real art from fake. A generative model is an artist who has studied so much real art that they can paint convincingly from scratch.
Input → Label
Example: "Is this email spam?"
Noise / Condition → Data
Example: "Generate a photo of a cat."
2. The Latent Space Idea
All three generative architectures share a core insight: compress data into a lower-dimensional representation called a latent space, then learn to decode points in that space back into real data.
Think of it this way: a face image has 256×256×3 = ~200,000 pixel values. But the meaningful variation — age, expression, lighting — might be captured in just 128 numbers. That compressed representation is the latent vector.
Encoder
Decoder
Key property: nearby points in latent space produce similar outputs. This is what lets you interpolate between two faces, or smoothly morph one image into another.
3. Variational Autoencoders (VAEs)
A VAE is the simplest generative architecture. It has two parts: an encoder that maps data to a latent distribution, and a decoder that maps a sampled latent vector back to data.
The key trick: instead of encoding to a fixed point, the encoder outputs a mean (μ) and variance (σ²). During training we sample from that Gaussian. This forces the latent space to be continuous and well-organized — you can sample any point and get a valid output.
vae.pyimport torch import torch.nn as nn class VAE(nn.Module): def __init__(self, input_dim=784, latent_dim=32): super().__init__() # Encoder: data → mu and log_var self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU()) self.fc_mu = nn.Linear(256, latent_dim) self.fc_logvar = nn.Linear(256, latent_dim) # Decoder: z → reconstructed data self.decoder = nn.Sequential( nn.Linear(latent_dim, 256), nn.ReLU(), nn.Linear(256, input_dim), nn.Sigmoid() ) def reparameterize(self, mu, logvar): # Sample z = mu + eps * sigma (keeps gradients flowing) std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std def forward(self, x): h = self.encoder(x) mu, logvar = self.fc_mu(h), self.fc_logvar(h) z = self.reparameterize(mu, logvar) return self.decoder(z), mu, logvar
The VAE Loss
VAE training minimizes two terms simultaneously:
VAE weakness: generated images tend to be blurry. Because the model averages over many possible reconstructions, fine details get smeared out. GANs and Diffusion models produce sharper results.
4. Generative Adversarial Networks (GANs)
Ian Goodfellow's 2014 insight: train two networks against each other. The Generator creates fake data. The Discriminator tries to distinguish real from fake. Each improves by competing with the other.
Training Dynamics
The Discriminator is trained to output 1 for real images and 0 for fakes. The Generator is trained to fool D — make it output 1 for fake images. At equilibrium, G produces images indistinguishable from real ones and D guesses 50/50 (the theoretical ideal — rarely achieved cleanly in practice).
Mode collapse: the most common GAN failure. G learns to produce one type of image that always fools D, instead of diverse outputs. Tricks like minibatch discrimination and Wasserstein loss help.
Notable GAN variants
5. Diffusion Models
Diffusion models (2020+) are now the dominant architecture for image generation, powering Stable Diffusion, DALL-E 3, and Midjourney. The idea: learn to reverse a noise-addition process.
Forward Process (Noising)
Gradually add Gaussian noise to an image over T steps until it becomes pure noise. This is fixed — no learning required.
Reverse Process (Denoising — what we train)
A neural network (typically a U-Net) learns to predict the noise added at each step. At inference, start from pure noise and iteratively denoise to get a clean image.
ddpm_inference.py# Simplified DDPM sampling loop def sample(model, timesteps=1000, shape=(1, 3, 64, 64)): x = torch.randn(shape) # start from pure noise for t in reversed(range(timesteps)): t_tensor = torch.tensor([t]) predicted_noise = model(x, t_tensor) # Remove predicted noise, add a little back for stochasticity x = denoise_step(x, predicted_noise, t) return x # final clean image
Why Diffusion Won
6. Which One Should You Use?
| Architecture | Quality | Speed | Stability | Best for |
|---|---|---|---|---|
| VAE | Medium | Fast | Stable | Latent space research, compression |
| GAN | High | Fast | Tricky | Image synthesis, style transfer |
| Diffusion | SOTA | Slow | Stable | Text-to-image, audio, video |
For production image generation today, Diffusion models are the default. GANs still win on speed (single forward pass vs. 20–1000 diffusion steps). VAEs are the go-to when you need a compact, searchable latent space.
7. Knowledge Check
5 questions · pick the best answer
8. What's Next
You now understand all three generative AI families. In the next post we go practical: