06 / 06·2026-06-30·20 min read·MLOpsServingScalingAdvanced
Getting a model to work on your laptop is the easy part. This final post covers what it takes
to run AI reliably in production: model serving, GPU optimization,
latency tuning, cost control, monitoring,
and scaling — everything between your notebook and real users.
Your first decision: where and how does the model run? There are three main approaches:
Managed API
Claude, GPT-4, Gemini
Zero infra overhead
Pay per token
Data leaves your infra
Best for: most products
Self-hosted
vLLM, TGI, Triton
Full data control
Fixed GPU cost
High ops overhead
Best for: regulated industries
Serverless GPU
Modal, RunPod, Replicate
Pay per second of GPU
Cold start latency
Easy to deploy
Best for: bursty workloads
💡
Default to managed APIs. Unless you have a specific reason to self-host (data privacy, cost at extreme scale, unique model), a managed API will ship faster, be more reliable, and cost less in engineer time.
2. Model Optimization
If you self-host, these techniques dramatically reduce memory and speed up inference:
Quantization
4–8x memory reduction
Reduce weight precision from FP32 to INT8 or INT4. Most modern models support 4-bit quantization with minimal quality loss. Use bitsandbytes or GPTQ.
Flash Attention 2
~2x speedup over FA1
Memory-efficient attention implementation. Drop-in replacement; works with transformers automatically when installed.
Continuous batching
2–4x throughput vs SOTA
Process multiple requests simultaneously using the same GPU pass. vLLM does this automatically with PagedAttention.
Speculative decoding
2–3x speedup
Use a small draft model to predict tokens, verify with the large model in batch. Reduces the number of forward passes needed.
Users notice latency above 200ms for first-token and above 1s for full response.
Here's how to hit those targets:
Strategy
Where it helps
Impact
Streaming
Perceived first-token
High
Smaller model
All latency
High
GPU region proximity
Network overhead
Medium
Prompt caching
Long system prompts
High
KV cache warming
Repeated context
Medium
Async processing
Non-blocking requests
Medium
⚡
Always stream. Streaming makes a 5-second response feel fast because users see output immediately. This is the single highest-leverage latency improvement with zero model changes.
4. Cost Control
AI inference can get expensive fast. These levers control the bill:
📏
Limit token usage
Set max_tokens aggressively. Most responses don't need 4096 tokens — cap at what your use case actually needs.
🗜️
Compress context
Use a sliding window or summarize old messages. Every input token costs money too, not just output.
⚡
Use cheaper models
Route simple tasks to Haiku/Flash, complex ones to Sonnet/Opus. A classifier + routing layer can cut costs 10x.
💾
Cache responses
For deterministic or repeat queries, cache the response. Identical prompts at temperature=0 always produce the same output.
model_router.py
defroute_model(user_message: str) -> str:
# Simple heuristic routing — use small model for short tasks
token_estimate = len(user_message.split()) * 1.3
if token_estimate < 200 and"summarize"not in user_message:
return"claude-haiku-4-5"# cheap & fastelif token_estimate < 1000:
return"claude-sonnet-4-6"# balancedelse:
return"claude-opus-4-5"# complex tasks
5. Monitoring & Observability
You can't improve what you don't measure. Track these metrics from day one:
Performance
Time to first token (TTFT) — target <500ms
Tokens per second — measure throughput
p95/p99 latency — worst-case user experience
Cost
Input / output tokens per request
Cost per user per day
Cache hit rate
Quality
User thumbs up/down rate
Refusal / error rate
LLM-as-judge evals — automated quality scoring
⚠️
Log everything. Store full request/response pairs (within privacy constraints). You'll need them to debug regressions, understand costs, and build eval datasets.
6. Scaling Patterns
When your traffic grows, here's how to scale without rewriting everything:
Stage 1
Single instance — one API server, async endpoints, streaming. Handles 10–100 concurrent users comfortably.
Stage 2
Job queue — Celery + Redis or BullMQ. Decouple request acceptance from model inference. Handles bursts gracefully.
Stage 3
Horizontal scaling — multiple inference workers behind a load balancer. Add workers as traffic grows.
Stage 4
Multi-region — deploy inference close to users. Use a CDN for static assets, regional APIs for inference.
7. Knowledge Check
5 questions · pick the best answer
Q1 of 5
When should you default to a managed API (Claude, GPT) instead of self-hosting?
Q2 of 5
What does quantization do to a model?
Q3 of 5
Why is streaming so effective for perceived latency?
Q4 of 5
What is the purpose of a job queue in an AI serving stack?
Q5 of 5
Which of these is NOT a recommended cost control strategy?
8. Series Complete
🎉
You've completed Building AI Model from Scratch. You now understand how models are trained, how generative AI works, how to work with LLMs, build chatbots, generate images, and deploy everything to production.
Here's what you covered across the 6 posts:
01Training loops, loss functions, gradient descent, architectures
02VAEs, GANs, Diffusion Models — the generative AI families