How AI Models
Are Built

01 / 06 · 2026-06-30 · 15 min read · Training Loss Functions Gradient Descent Beginner

Before you can build generative AI, you need to understand how any AI model learns. This post covers the training loop, loss functions, gradient descent, and the four main model architectures — the engine under every LLM, image model, and classifier you've ever used. No PhD required.

In this post

1. The Training Loop 2. Datasets: The Fuel 3. Loss Functions 4. Gradient Descent & Backprop 5. Model Architectures 6. Training Your First Model 7. Knowledge Check 8. What's Next

1. The Training Loop

Every AI model — whether it's a spam classifier or GPT-4 — learns through the same fundamental cycle: the training loop. It's deceptively simple. Run it millions of times and intelligence emerges.

Forward Pass

Feed input data through the model. The model makes a prediction using its current weights.

Calculate Loss

Compare the prediction to the ground truth. Compute how wrong the model was — this is the “loss”.

Backpropagation

Work backwards through the model to calculate how much each weight contributed to the error.

Update Weights

Nudge each weight slightly in the direction that reduces the error. Repeat.

↻ Repeat for millions of examples

A helpful analogy: think of how a child learns to throw a dart. They try (forward pass), they see how far from the bullseye they landed (loss), their brain figures out what went wrong — too much wrist, not enough force (backprop) — and they adjust for the next throw (update weights).

🎯

Try

Forward pass: the model makes a prediction with current knowledge.

→

📏

Get Feedback

Loss calculation: measure exactly how wrong the prediction was.

→

🔧

Adjust

Gradient descent: update weights to be slightly less wrong next time.

💡

Key insight: the model doesn't understand anything. It's adjusting numbers (weights) based on math. After enough adjustments, those numbers happen to encode useful patterns. That's all “intelligence” is in these systems.

2. Datasets: The Fuel

The training loop can only run if you have data to feed it. Datasets are the fuel — and the quality of your dataset determines the ceiling of your model's performance. Garbage in, garbage out is the most important rule in ML.

⚡

Quality vs Quantity: 10,000 clean, well-labeled examples often beat 1,000,000 noisy ones. Models trained on bad data confidently learn the wrong things.

Labeled vs. Unlabeled Data

Labeled data has answers attached — "this image is a cat", "this review is positive". Training on labeled data is called supervised learning. It's powerful but expensive because humans must do the labeling.

Unlabeled data has no pre-attached answers. LLMs like GPT are trained this way — the "label" for each token is the next token in the text, so the web itself provides infinite labels for free.

Train / Validation / Test Splits

Split	Share	Purpose
Training	70%	The model learns from this data — weights are updated based on it.
Validation	15%	Used during training to tune hyperparameters and detect overfitting.
Test	15%	Touched once, at the very end — to measure real-world performance.

What a Training Row Looks Like

Image Classification

Input (X)28×28 pixel array (grayscale values 0–255)

Label (Y)Integer 0–9 (digit class)

Example[[0,0,128,255…], …] → 7

Text Generation

Input (X)Sequence of token IDs: [1045, 2293, 9435]

Label (Y)Next token ID: [2937] (the word “dogs”)

Example“I love” → “dogs”

3. Loss Functions

The loss function is the model's error signal — it measures the gap between what the model predicted and the correct answer. Lower loss = better model. Training is the process of minimizing this number.

Think of it like darts: the loss is the distance between where your dart landed and the bullseye. The training loop is asking: “how do I throw differently to land closer?”

Mean Squared Error (MSE)

Regression

Loss = mean((predicted − actual)²)

Used when predicting a continuous number — like a house price or temperature. Squaring the error penalizes big mistakes more than small ones.

Example: predicting stock price → model guesses $150, actual is $120 → loss = (150−120)² = 900

Cross-Entropy Loss

Classification

Loss = −log(predicted probability of correct class)

Used when classifying into categories — spam/not-spam, dog/cat. Measures how confident the model was about the correct answer.

Example: model says 90% chance of cat (correct) → low loss. 10% chance → high loss.

Contrastive Loss

Similarity / Embeddings

Loss = pull similar pairs together, push different pairs apart

Used for learning similarity — used in embedding models and CLIP. Trains the model to cluster similar things close in vector space.

Example: “dog photo” and “a photo of a dog” should have embeddings close together.

💡

Why does the choice matter? The wrong loss function gives the model the wrong optimization target. If you use MSE for a classification task, the model might minimize the number but still classify things wrong. Loss function design is a key engineering decision.

4. Gradient Descent & Backpropagation

You have a loss. Now what? You need to update the model's weights to reduce it. That's what gradient descent does — it's the optimization algorithm that actually "trains" the model.

The Mountain Analogy

Imagine you're blindfolded on a hilly mountain, and your goal is to reach the valley (minimum loss). You can't see the whole landscape — but you can feel the slope under your feet. Gradient descent says: take a step in the downhill direction. Repeat until you're in a valley.

⛰️

You're on the mountain

High loss = high altitude. The goal is the valley (low loss).

→

📐

Feel the slope

The gradient tells you which direction is downhill from your current position.

→

👣

Take a step

Update weights by a small amount in the downhill direction. The learning rate controls step size.

The Learning Rate

Too High

💥

You overshoot the valley and bounce around — or diverge entirely. Loss gets worse, not better.

Just Right

✅

Steady progress toward the minimum. Loss decreases smoothly over training epochs.

Too Low

🐌

Glacially slow progress. Training takes 100x longer than necessary, or gets stuck.

Backpropagation

Before gradient descent can update the weights, it needs to know the gradient for each weight in the network. With billions of parameters, computing this naively would be impossibly slow. Backpropagation solves this with the chain rule of calculus — it propagates the error signal backwards through the network layer by layer, computing each weight's gradient efficiently.

You don't need to implement backprop yourself — PyTorch and TensorFlow do it automatically via .backward(). But understanding it conceptually helps you debug training problems.

SGD vs. Adam

SGD — classic, sensitive to lr

optimizer = torch.optim.SGD(
  model.parameters(),
  lr=0.01,
  momentum=0.9
)

Adam — adaptive, default choice

optimizer = torch.optim.Adam(
  model.parameters(),
  lr=1e-3
)
# Adapts lr per-parameter

5. Model Architectures

The training loop is the same for all models. What changes is the architecture — the structure of the model itself. Different architectures are suited to different types of data and tasks.

📊

Linear / Logistic Regression Simplest

The baseline. A single layer mapping inputs directly to outputs. Useful for simple problems and as a starting point.

Tabular data, baselines

🖼️

CNN Images

Convolutional filters slide across input detecting local patterns (edges, textures, shapes). Efficient because the same filter reuses weights across positions.

Image classification, object detection

⏩

RNN / LSTM Sequences

Processes data step-by-step, maintaining a hidden state. LSTMs add gates to control memory. Largely superseded by Transformers.

Time series, audio, legacy NLP

✨

Transformer State of the Art

Self-attention relates every token to every other token in parallel. The architecture behind every modern LLM. Scales incredibly well.

Text, code, multimodal — basically everything now

⚡

Architecture choice in practice: for most tasks today, start with a pre-trained Transformer (BERT for understanding, GPT for generation). Training CNNs and RNNs from scratch is rare — you fine-tune existing architectures.

6. Putting It Together: Training Your First Model

Here's a complete training pipeline in annotated pseudocode. This structure works for everything from a simple classifier to fine-tuning a language model.

        
          
          Python — complete training pipeline
        
        # ── Step 1: Load and prepare your data ──
train_data, val_data, test_data = load_and_split_dataset(
    path="data/", splits=[0.70, 0.15, 0.15]
)
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)

# ── Step 2: Define your model ──
model = TransformerClassifier(
    vocab_size=50000, hidden_dim=256, num_layers=4, num_classes=2
)

# ── Step 3: Define loss and optimizer ──
loss_fn = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=1e-4)

# ── Step 4: The training loop ──
for epoch in range(num_epochs):
    model.train()
    for batch_inputs, batch_labels in train_loader:
        predictions = model(batch_inputs)        # forward pass
        loss = loss_fn(predictions, batch_labels)  # compute loss
        optimizer.zero_grad()                   # clear gradients
        loss.backward()                         # backprop
        optimizer.step()                         # update weights

    # ── Step 5: Validate ──
    val_loss, val_acc = evaluate(model, val_loader)
    print(f"Epoch {epoch}: val_loss={val_loss:.4f}, acc={val_acc:.2%}")

# ── Step 6: Final test evaluation ──
test_acc = evaluate(model, test_loader)
print(f"Test accuracy: {test_acc:.2%}")

⚡

What changes for LLMs? Scale. The structure above applies to fine-tuning GPT or Llama — you're still running this loop. The difference is the model has billions of parameters, batches are token sequences, and loss is cross-entropy over the vocabulary at each position.

7. Knowledge Check

Five questions to test your understanding. Click an answer to see instant feedback.

Question 1 of 5

What does "loss" measure in a machine learning model?

Question 2 of 5

In gradient descent, what does the "learning rate" control?

Question 3 of 5

What is backpropagation?

Question 4 of 5

Which architecture is best suited for processing image data?

Question 5 of 5

What is the purpose of the validation set during training?

8. What's Next

You now understand the core mechanics of how any AI model is trained. The next posts build on this foundation — going into generative models, LLMs, and building real applications.

Generative AI Fundamentals Coming Soon

VAEs, GANs, Diffusion Models — how machines learn to generate images, text and audio.

Working with LLMs Coming Soon

Fine-tuning, prompt engineering, RAG, and embedding search for production LLM apps.

Building a Chatbot Coming Soon

End-to-end walkthrough: from API calls to memory, tools, and deployment.

Image Generation Pipeline Coming Soon

Stable Diffusion internals, ControlNet, LoRA fine-tuning for custom image models.

Deploying AI to Production Coming Soon

Serving models, latency optimization, monitoring, cost control, and scaling.

How AI ModelsAre Built

1. The Training Loop

2. Datasets: The Fuel

Labeled vs. Unlabeled Data

Train / Validation / Test Splits

What a Training Row Looks Like

3. Loss Functions

4. Gradient Descent & Backpropagation

The Mountain Analogy

The Learning Rate

Backpropagation

SGD vs. Adam

5. Model Architectures

6. Putting It Together: Training Your First Model

7. Knowledge Check

8. What's Next

How AI Models
Are Built