01 / 06·2026-06-30·15 min read·TrainingLoss FunctionsGradient DescentBeginner
Before you can build generative AI, you need to understand how any AI model
learns. This post covers the training loop, loss functions, gradient descent, and the four
main model architectures — the engine under every LLM, image model, and classifier you've
ever used. No PhD required.
Every AI model — whether it's a spam classifier or GPT-4 — learns through the same
fundamental cycle: the training loop. It's deceptively simple. Run it
millions of times and intelligence emerges.
1
Forward Pass
Feed input data through the model. The model makes a prediction using its current weights.
2
Calculate Loss
Compare the prediction to the ground truth. Compute how wrong the model was — this is the “loss”.
3
Backpropagation
Work backwards through the model to calculate how much each weight contributed to the error.
4
Update Weights
Nudge each weight slightly in the direction that reduces the error. Repeat.
↻ Repeat for millions of examples
A helpful analogy: think of how a child learns to throw a dart. They try
(forward pass), they see how far from the bullseye they landed (loss), their brain figures
out what went wrong — too much wrist, not enough force (backprop) — and they adjust for
the next throw (update weights).
🎯
Try
Forward pass: the model makes a prediction with current knowledge.
→
📏
Get Feedback
Loss calculation: measure exactly how wrong the prediction was.
→
🔧
Adjust
Gradient descent: update weights to be slightly less wrong next time.
💡
Key insight: the model doesn't understand anything. It's adjusting numbers (weights) based on math. After enough adjustments, those numbers happen to encode useful patterns. That's all “intelligence” is in these systems.
2. Datasets: The Fuel
The training loop can only run if you have data to feed it. Datasets are the fuel — and
the quality of your dataset determines the ceiling of your model's performance.
Garbage in, garbage out is the most important rule in ML.
⚡
Quality vs Quantity: 10,000 clean, well-labeled examples often beat 1,000,000 noisy ones. Models trained on bad data confidently learn the wrong things.
Labeled vs. Unlabeled Data
Labeled data has answers attached — "this image is a cat", "this review
is positive". Training on labeled data is called supervised learning. It's powerful
but expensive because humans must do the labeling.
Unlabeled data has no pre-attached answers. LLMs like GPT are trained
this way — the "label" for each token is the next token in the text, so the web itself
provides infinite labels for free.
Train / Validation / Test Splits
Split
Share
Purpose
Training
70%
The model learns from this data — weights are updated based on it.
Validation
15%
Used during training to tune hyperparameters and detect overfitting.
Test
15%
Touched once, at the very end — to measure real-world performance.
Input (X)Sequence of token IDs: [1045, 2293, 9435]
Label (Y)Next token ID: [2937] (the word “dogs”)
Example“I love” → “dogs”
3. Loss Functions
The loss function is the model's error signal — it measures the gap
between what the model predicted and the correct answer. Lower loss = better model.
Training is the process of minimizing this number.
Think of it like darts: the loss is the distance between where your dart landed and the
bullseye. The training loop is asking: “how do I throw differently to land closer?”
Mean Squared Error (MSE)
Regression
Loss = mean((predicted − actual)²)
Used when predicting a continuous number — like a house price or temperature. Squaring the error penalizes big mistakes more than small ones.
Example: predicting stock price → model guesses $150, actual is $120 → loss = (150−120)² = 900
Cross-Entropy Loss
Classification
Loss = −log(predicted probability of correct class)
Used when classifying into categories — spam/not-spam, dog/cat. Measures how confident the model was about the correct answer.
Example: model says 90% chance of cat (correct) → low loss. 10% chance → high loss.
Contrastive Loss
Similarity / Embeddings
Loss = pull similar pairs together, push different pairs apart
Used for learning similarity — used in embedding models and CLIP. Trains the model to cluster similar things close in vector space.
Example: “dog photo” and “a photo of a dog” should have embeddings close together.
💡
Why does the choice matter? The wrong loss function gives the model the wrong optimization target. If you use MSE for a classification task, the model might minimize the number but still classify things wrong. Loss function design is a key engineering decision.
4. Gradient Descent & Backpropagation
You have a loss. Now what? You need to update the model's weights to reduce it.
That's what gradient descent does — it's the optimization algorithm
that actually "trains" the model.
The Mountain Analogy
Imagine you're blindfolded on a hilly mountain, and your goal is to reach the valley
(minimum loss). You can't see the whole landscape — but you can feel the slope under
your feet. Gradient descent says: take a step in the downhill direction. Repeat
until you're in a valley.
⛰️
You're on the mountain
High loss = high altitude. The goal is the valley (low loss).
→
📐
Feel the slope
The gradient tells you which direction is downhill from your current position.
→
👣
Take a step
Update weights by a small amount in the downhill direction. The learning rate controls step size.
The Learning Rate
Too High
💥
You overshoot the valley and bounce around — or diverge entirely. Loss gets worse, not better.
Just Right
✅
Steady progress toward the minimum. Loss decreases smoothly over training epochs.
Too Low
🐌
Glacially slow progress. Training takes 100x longer than necessary, or gets stuck.
Backpropagation
Before gradient descent can update the weights, it needs to know the gradient for
each weight in the network. With billions of parameters, computing this naively
would be impossibly slow. Backpropagation solves this with the chain rule
of calculus — it propagates the error signal backwards through the network layer by layer,
computing each weight's gradient efficiently.
You don't need to implement backprop yourself — PyTorch and TensorFlow do it automatically
via .backward(). But understanding it conceptually helps you debug training problems.
optimizer = torch.optim.Adam(
model.parameters(),
lr=1e-3
)
# Adapts lr per-parameter
5. Model Architectures
The training loop is the same for all models. What changes is the architecture
— the structure of the model itself. Different architectures are suited to different types
of data and tasks.
📊
Linear / Logistic Regression Simplest
The baseline. A single layer mapping inputs directly to outputs. Useful for simple problems and as a starting point.
Tabular data, baselines
🖼️
CNN Images
Convolutional filters slide across input detecting local patterns (edges, textures, shapes). Efficient because the same filter reuses weights across positions.
Image classification, object detection
⏩
RNN / LSTM Sequences
Processes data step-by-step, maintaining a hidden state. LSTMs add gates to control memory. Largely superseded by Transformers.
Time series, audio, legacy NLP
✨
Transformer State of the Art
Self-attention relates every token to every other token in parallel. The architecture behind every modern LLM. Scales incredibly well.
Text, code, multimodal — basically everything now
⚡
Architecture choice in practice: for most tasks today, start with a pre-trained Transformer (BERT for understanding, GPT for generation). Training CNNs and RNNs from scratch is rare — you fine-tune existing architectures.
6. Putting It Together: Training Your First Model
Here's a complete training pipeline in annotated pseudocode. This structure works for
everything from a simple classifier to fine-tuning a language model.
Python — complete training pipeline
# ── Step 1: Load and prepare your data ──
train_data, val_data, test_data = load_and_split_dataset(
path="data/", splits=[0.70, 0.15, 0.15]
)
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
# ── Step 2: Define your model ──
model = TransformerClassifier(
vocab_size=50000, hidden_dim=256, num_layers=4, num_classes=2
)
# ── Step 3: Define loss and optimizer ──
loss_fn = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=1e-4)
# ── Step 4: The training loop ──for epoch in range(num_epochs):
model.train()
for batch_inputs, batch_labels in train_loader:
predictions = model(batch_inputs) # forward pass
loss = loss_fn(predictions, batch_labels) # compute loss
optimizer.zero_grad() # clear gradients
loss.backward() # backprop
optimizer.step() # update weights# ── Step 5: Validate ──
val_loss, val_acc = evaluate(model, val_loader)
print(f"Epoch {epoch}: val_loss={val_loss:.4f}, acc={val_acc:.2%}")
# ── Step 6: Final test evaluation ──
test_acc = evaluate(model, test_loader)
print(f"Test accuracy: {test_acc:.2%}")
⚡
What changes for LLMs? Scale. The structure above applies to fine-tuning GPT or Llama — you're still running this loop. The difference is the model has billions of parameters, batches are token sequences, and loss is cross-entropy over the vocabulary at each position.
7. Knowledge Check
Five questions to test your understanding. Click an answer to see instant feedback.
Question 1 of 5
What does "loss" measure in a machine learning model?
Correct! Loss measures the error — how far off the model's prediction was from the ground truth. The entire goal of training is to minimize this number.
Not quite. Loss measures the difference between what the model predicted and the actual correct answer. It's the error signal that drives all of training.
Question 2 of 5
In gradient descent, what does the "learning rate" control?
Correct! The learning rate controls step size. Too high and you overshoot the valley; too low and training crawls. Finding the right learning rate is one of the key tasks in ML engineering.
Not quite. The learning rate controls the size of each weight update step. Think of it as stride length when walking downhill — too big and you overshoot, too small and you barely move.
Question 3 of 5
What is backpropagation?
Correct! Backpropagation uses the chain rule of calculus to propagate the error signal backwards through the network, computing each weight's gradient efficiently.
Not quite. Backpropagation is the algorithm that computes how much each weight in the network contributed to the error. It propagates the loss signal backwards through layers using the chain rule of calculus.
Question 4 of 5
Which architecture is best suited for processing image data?
Correct! CNNs use convolutional filters that slide across the image, detecting local spatial patterns like edges and textures. This spatial inductive bias makes them highly efficient for image data.
Not quite. CNNs (Convolutional Neural Networks) are designed specifically for spatial data like images. Their convolutional filters detect local patterns efficiently by sharing weights across positions.
Question 5 of 5
What is the purpose of the validation set during training?
Correct! The validation set is used during training to check how the model generalizes to unseen data, tune hyperparameters like learning rate, and detect when the model starts overfitting.
Not quite. The validation set is used during training to tune hyperparameters and spot overfitting. The test set (which you keep completely separate) is what measures final real-world performance.
8. What's Next
You now understand the core mechanics of how any AI model is trained. The next posts
build on this foundation — going into generative models, LLMs, and building real
applications.
02
Generative AI Fundamentals Coming Soon
VAEs, GANs, Diffusion Models — how machines learn to generate images, text and audio.
03
Working with LLMs Coming Soon
Fine-tuning, prompt engineering, RAG, and embedding search for production LLM apps.
04
Building a Chatbot Coming Soon
End-to-end walkthrough: from API calls to memory, tools, and deployment.
05
Image Generation Pipeline Coming Soon
Stable Diffusion internals, ControlNet, LoRA fine-tuning for custom image models.
06
Deploying AI to Production Coming Soon
Serving models, latency optimization, monitoring, cost control, and scaling.