Topics Building AI Model from Scratch
Series · 6 posts Contact

Deploying AI
to Production

Getting a model to work on your laptop is the easy part. This final post covers what it takes to run AI reliably in production: model serving, GPU optimization, latency tuning, cost control, monitoring, and scaling — everything between your notebook and real users.

1. Serving Options

Your first decision: where and how does the model run? There are three main approaches:

Managed API
Claude, GPT-4, Gemini
  • Zero infra overhead
  • Pay per token
  • Data leaves your infra
  • Best for: most products
Self-hosted
vLLM, TGI, Triton
  • Full data control
  • Fixed GPU cost
  • High ops overhead
  • Best for: regulated industries
Serverless GPU
Modal, RunPod, Replicate
  • Pay per second of GPU
  • Cold start latency
  • Easy to deploy
  • Best for: bursty workloads
💡

Default to managed APIs. Unless you have a specific reason to self-host (data privacy, cost at extreme scale, unique model), a managed API will ship faster, be more reliable, and cost less in engineer time.

2. Model Optimization

If you self-host, these techniques dramatically reduce memory and speed up inference:

Quantization
4–8x memory reduction
Reduce weight precision from FP32 to INT8 or INT4. Most modern models support 4-bit quantization with minimal quality loss. Use bitsandbytes or GPTQ.
Flash Attention 2
~2x speedup over FA1
Memory-efficient attention implementation. Drop-in replacement; works with transformers automatically when installed.
Continuous batching
2–4x throughput vs SOTA
Process multiple requests simultaneously using the same GPU pass. vLLM does this automatically with PagedAttention.
Speculative decoding
2–3x speedup
Use a small draft model to predict tokens, verify with the large model in batch. Reduces the number of forward passes needed.
vllm_serve.py
# vLLM: state-of-the-art LLM serving with continuous batching from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-3-8B-Instruct", quantization="awq", # 4-bit quantization max_model_len=4096, gpu_memory_utilization=0.9, ) params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate(["Explain RAG in simple terms"], params) print(outputs[0].outputs[0].text)

3. Latency Strategies

Users notice latency above 200ms for first-token and above 1s for full response. Here's how to hit those targets:

Strategy
Where it helps
Impact
Streaming
Perceived first-token
High
Smaller model
All latency
High
GPU region proximity
Network overhead
Medium
Prompt caching
Long system prompts
High
KV cache warming
Repeated context
Medium
Async processing
Non-blocking requests
Medium

Always stream. Streaming makes a 5-second response feel fast because users see output immediately. This is the single highest-leverage latency improvement with zero model changes.

4. Cost Control

AI inference can get expensive fast. These levers control the bill:

📏
Limit token usage
Set max_tokens aggressively. Most responses don't need 4096 tokens — cap at what your use case actually needs.
🗜️
Compress context
Use a sliding window or summarize old messages. Every input token costs money too, not just output.
Use cheaper models
Route simple tasks to Haiku/Flash, complex ones to Sonnet/Opus. A classifier + routing layer can cut costs 10x.
💾
Cache responses
For deterministic or repeat queries, cache the response. Identical prompts at temperature=0 always produce the same output.
model_router.py
def route_model(user_message: str) -> str: # Simple heuristic routing — use small model for short tasks token_estimate = len(user_message.split()) * 1.3 if token_estimate < 200 and "summarize" not in user_message: return "claude-haiku-4-5" # cheap & fast elif token_estimate < 1000: return "claude-sonnet-4-6" # balanced else: return "claude-opus-4-5" # complex tasks

5. Monitoring & Observability

You can't improve what you don't measure. Track these metrics from day one:

Performance
Time to first token (TTFT) — target <500ms
Tokens per second — measure throughput
p95/p99 latency — worst-case user experience
Cost
Input / output tokens per request
Cost per user per day
Cache hit rate
Quality
User thumbs up/down rate
Refusal / error rate
LLM-as-judge evals — automated quality scoring
⚠️

Log everything. Store full request/response pairs (within privacy constraints). You'll need them to debug regressions, understand costs, and build eval datasets.

6. Scaling Patterns

When your traffic grows, here's how to scale without rewriting everything:

Stage 1
Single instance — one API server, async endpoints, streaming. Handles 10–100 concurrent users comfortably.
Stage 2
Job queue — Celery + Redis or BullMQ. Decouple request acceptance from model inference. Handles bursts gracefully.
Stage 3
Horizontal scaling — multiple inference workers behind a load balancer. Add workers as traffic grows.
Stage 4
Multi-region — deploy inference close to users. Use a CDN for static assets, regional APIs for inference.

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5
When should you default to a managed API (Claude, GPT) instead of self-hosting?
Q2 of 5
What does quantization do to a model?
Q3 of 5
Why is streaming so effective for perceived latency?
Q4 of 5
What is the purpose of a job queue in an AI serving stack?
Q5 of 5
Which of these is NOT a recommended cost control strategy?

8. Series Complete

🎉

You've completed Building AI Model from Scratch. You now understand how models are trained, how generative AI works, how to work with LLMs, build chatbots, generate images, and deploy everything to production.

Here's what you covered across the 6 posts:

01Training loops, loss functions, gradient descent, architectures
02VAEs, GANs, Diffusion Models — the generative AI families
03Prompt engineering, fine-tuning, RAG, embeddings
04Chatbot with memory, tool use, streaming, production API
05Stable Diffusion, ControlNet, LoRA, image generation API
06Serving, optimization, latency, cost, monitoring, scaling
← Previous Image Generation Pipeline
↑ Series Building AI Model from Scratch