Topics Building AI Model from Scratch
Series · 6 posts Contact

How AI Models
Are Built

Before you can build generative AI, you need to understand how any AI model learns. This post covers the training loop, loss functions, gradient descent, and the four main model architectures — the engine under every LLM, image model, and classifier you've ever used. No PhD required.

1. The Training Loop

Every AI model — whether it's a spam classifier or GPT-4 — learns through the same fundamental cycle: the training loop. It's deceptively simple. Run it millions of times and intelligence emerges.

1
Forward Pass
Feed input data through the model. The model makes a prediction using its current weights.
2
Calculate Loss
Compare the prediction to the ground truth. Compute how wrong the model was — this is the “loss”.
3
Backpropagation
Work backwards through the model to calculate how much each weight contributed to the error.
4
Update Weights
Nudge each weight slightly in the direction that reduces the error. Repeat.
↻ Repeat for millions of examples

A helpful analogy: think of how a child learns to throw a dart. They try (forward pass), they see how far from the bullseye they landed (loss), their brain figures out what went wrong — too much wrist, not enough force (backprop) — and they adjust for the next throw (update weights).

🎯
Try
Forward pass: the model makes a prediction with current knowledge.
📏
Get Feedback
Loss calculation: measure exactly how wrong the prediction was.
🔧
Adjust
Gradient descent: update weights to be slightly less wrong next time.
💡

Key insight: the model doesn't understand anything. It's adjusting numbers (weights) based on math. After enough adjustments, those numbers happen to encode useful patterns. That's all “intelligence” is in these systems.

2. Datasets: The Fuel

The training loop can only run if you have data to feed it. Datasets are the fuel — and the quality of your dataset determines the ceiling of your model's performance. Garbage in, garbage out is the most important rule in ML.

Quality vs Quantity: 10,000 clean, well-labeled examples often beat 1,000,000 noisy ones. Models trained on bad data confidently learn the wrong things.

Labeled vs. Unlabeled Data

Labeled data has answers attached — "this image is a cat", "this review is positive". Training on labeled data is called supervised learning. It's powerful but expensive because humans must do the labeling.

Unlabeled data has no pre-attached answers. LLMs like GPT are trained this way — the "label" for each token is the next token in the text, so the web itself provides infinite labels for free.

Train / Validation / Test Splits

SplitSharePurpose
Training 70% The model learns from this data — weights are updated based on it.
Validation 15% Used during training to tune hyperparameters and detect overfitting.
Test 15% Touched once, at the very end — to measure real-world performance.

What a Training Row Looks Like

Image Classification
Input (X)28×28 pixel array (grayscale values 0–255)
Label (Y)Integer 0–9 (digit class)
Example[[0,0,128,255…], …] → 7
Text Generation
Input (X)Sequence of token IDs: [1045, 2293, 9435]
Label (Y)Next token ID: [2937] (the word “dogs”)
Example“I love” → “dogs”

3. Loss Functions

The loss function is the model's error signal — it measures the gap between what the model predicted and the correct answer. Lower loss = better model. Training is the process of minimizing this number.

Think of it like darts: the loss is the distance between where your dart landed and the bullseye. The training loop is asking: “how do I throw differently to land closer?”

Mean Squared Error (MSE)
Regression
Loss = mean((predicted − actual)²)
Used when predicting a continuous number — like a house price or temperature. Squaring the error penalizes big mistakes more than small ones.
Example: predicting stock price → model guesses $150, actual is $120 → loss = (150−120)² = 900
Cross-Entropy Loss
Classification
Loss = −log(predicted probability of correct class)
Used when classifying into categories — spam/not-spam, dog/cat. Measures how confident the model was about the correct answer.
Example: model says 90% chance of cat (correct) → low loss. 10% chance → high loss.
Contrastive Loss
Similarity / Embeddings
Loss = pull similar pairs together, push different pairs apart
Used for learning similarity — used in embedding models and CLIP. Trains the model to cluster similar things close in vector space.
Example: “dog photo” and “a photo of a dog” should have embeddings close together.
💡

Why does the choice matter? The wrong loss function gives the model the wrong optimization target. If you use MSE for a classification task, the model might minimize the number but still classify things wrong. Loss function design is a key engineering decision.

4. Gradient Descent & Backpropagation

You have a loss. Now what? You need to update the model's weights to reduce it. That's what gradient descent does — it's the optimization algorithm that actually "trains" the model.

The Mountain Analogy

Imagine you're blindfolded on a hilly mountain, and your goal is to reach the valley (minimum loss). You can't see the whole landscape — but you can feel the slope under your feet. Gradient descent says: take a step in the downhill direction. Repeat until you're in a valley.

⛰️
You're on the mountain
High loss = high altitude. The goal is the valley (low loss).
📐
Feel the slope
The gradient tells you which direction is downhill from your current position.
👣
Take a step
Update weights by a small amount in the downhill direction. The learning rate controls step size.

The Learning Rate

Too High
💥
You overshoot the valley and bounce around — or diverge entirely. Loss gets worse, not better.
Just Right
Steady progress toward the minimum. Loss decreases smoothly over training epochs.
Too Low
🐌
Glacially slow progress. Training takes 100x longer than necessary, or gets stuck.

Backpropagation

Before gradient descent can update the weights, it needs to know the gradient for each weight in the network. With billions of parameters, computing this naively would be impossibly slow. Backpropagation solves this with the chain rule of calculus — it propagates the error signal backwards through the network layer by layer, computing each weight's gradient efficiently.

You don't need to implement backprop yourself — PyTorch and TensorFlow do it automatically via .backward(). But understanding it conceptually helps you debug training problems.

SGD vs. Adam

SGD — classic, sensitive to lr
optimizer = torch.optim.SGD( model.parameters(), lr=0.01, momentum=0.9 )
Adam — adaptive, default choice
optimizer = torch.optim.Adam( model.parameters(), lr=1e-3 ) # Adapts lr per-parameter

5. Model Architectures

The training loop is the same for all models. What changes is the architecture — the structure of the model itself. Different architectures are suited to different types of data and tasks.

📊
Linear / Logistic Regression Simplest
The baseline. A single layer mapping inputs directly to outputs. Useful for simple problems and as a starting point.
Tabular data, baselines
🖼️
CNN Images
Convolutional filters slide across input detecting local patterns (edges, textures, shapes). Efficient because the same filter reuses weights across positions.
Image classification, object detection
RNN / LSTM Sequences
Processes data step-by-step, maintaining a hidden state. LSTMs add gates to control memory. Largely superseded by Transformers.
Time series, audio, legacy NLP
Transformer State of the Art
Self-attention relates every token to every other token in parallel. The architecture behind every modern LLM. Scales incredibly well.
Text, code, multimodal — basically everything now

Architecture choice in practice: for most tasks today, start with a pre-trained Transformer (BERT for understanding, GPT for generation). Training CNNs and RNNs from scratch is rare — you fine-tune existing architectures.

6. Putting It Together: Training Your First Model

Here's a complete training pipeline in annotated pseudocode. This structure works for everything from a simple classifier to fine-tuning a language model.

        
Python — complete training pipeline
# ── Step 1: Load and prepare your data ── train_data, val_data, test_data = load_and_split_dataset( path="data/", splits=[0.70, 0.15, 0.15] ) train_loader = DataLoader(train_data, batch_size=32, shuffle=True) # ── Step 2: Define your model ── model = TransformerClassifier( vocab_size=50000, hidden_dim=256, num_layers=4, num_classes=2 ) # ── Step 3: Define loss and optimizer ── loss_fn = CrossEntropyLoss() optimizer = Adam(model.parameters(), lr=1e-4) # ── Step 4: The training loop ── for epoch in range(num_epochs): model.train() for batch_inputs, batch_labels in train_loader: predictions = model(batch_inputs) # forward pass loss = loss_fn(predictions, batch_labels) # compute loss optimizer.zero_grad() # clear gradients loss.backward() # backprop optimizer.step() # update weights # ── Step 5: Validate ── val_loss, val_acc = evaluate(model, val_loader) print(f"Epoch {epoch}: val_loss={val_loss:.4f}, acc={val_acc:.2%}") # ── Step 6: Final test evaluation ── test_acc = evaluate(model, test_loader) print(f"Test accuracy: {test_acc:.2%}")

What changes for LLMs? Scale. The structure above applies to fine-tuning GPT or Llama — you're still running this loop. The difference is the model has billions of parameters, batches are token sequences, and loss is cross-entropy over the vocabulary at each position.

7. Knowledge Check

Five questions to test your understanding. Click an answer to see instant feedback.

Question 1 of 5
What does "loss" measure in a machine learning model?
Question 2 of 5
In gradient descent, what does the "learning rate" control?
Question 3 of 5
What is backpropagation?
Question 4 of 5
Which architecture is best suited for processing image data?
Question 5 of 5
What is the purpose of the validation set during training?

8. What's Next

You now understand the core mechanics of how any AI model is trained. The next posts build on this foundation — going into generative models, LLMs, and building real applications.

02
Generative AI Fundamentals Coming Soon
VAEs, GANs, Diffusion Models — how machines learn to generate images, text and audio.
03
Working with LLMs Coming Soon
Fine-tuning, prompt engineering, RAG, and embedding search for production LLM apps.
04
Building a Chatbot Coming Soon
End-to-end walkthrough: from API calls to memory, tools, and deployment.
05
Image Generation Pipeline Coming Soon
Stable Diffusion internals, ControlNet, LoRA fine-tuning for custom image models.
06
Deploying AI to Production Coming Soon
Serving models, latency optimization, monitoring, cost control, and scaling.
← Series overview Post 02 coming soon