Deploying AI
to Production

06 / 06 · 2026-06-30 · 20 min read · MLOps Serving Scaling Advanced

Getting a model to work on your laptop is the easy part. This final post covers what it takes to run AI reliably in production: model serving, GPU optimization, latency tuning, cost control, monitoring, and scaling — everything between your notebook and real users.

In this post

1. Serving Options 2. Model Optimization 3. Latency Strategies 4. Cost Control 5. Monitoring & Observability 6. Scaling Patterns 7. Knowledge Check 8. Series Complete

1. Serving Options

Your first decision: where and how does the model run? There are three main approaches:

Managed API

Claude, GPT-4, Gemini

Zero infra overhead
Pay per token
Data leaves your infra
Best for: most products

Self-hosted

vLLM, TGI, Triton

Full data control
Fixed GPU cost
High ops overhead
Best for: regulated industries

Serverless GPU

Modal, RunPod, Replicate

Pay per second of GPU
Cold start latency
Easy to deploy
Best for: bursty workloads

💡

Default to managed APIs. Unless you have a specific reason to self-host (data privacy, cost at extreme scale, unique model), a managed API will ship faster, be more reliable, and cost less in engineer time.

2. Model Optimization

If you self-host, these techniques dramatically reduce memory and speed up inference:

Quantization

4–8x memory reduction

Reduce weight precision from FP32 to INT8 or INT4. Most modern models support 4-bit quantization with minimal quality loss. Use bitsandbytes or GPTQ.

Flash Attention 2

~2x speedup over FA1

Memory-efficient attention implementation. Drop-in replacement; works with transformers automatically when installed.

Continuous batching

2–4x throughput vs SOTA

Process multiple requests simultaneously using the same GPU pass. vLLM does this automatically with PagedAttention.

Speculative decoding

2–3x speedup

Use a small draft model to predict tokens, verify with the large model in batch. Reduces the number of forward passes needed.

vllm_serve.py
# vLLM: state-of-the-art LLM serving with continuous batching
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    quantization="awq",     # 4-bit quantization
    max_model_len=4096,
    gpu_memory_utilization=0.9,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain RAG in simple terms"], params)
print(outputs[0].outputs[0].text)

3. Latency Strategies

Users notice latency above 200ms for first-token and above 1s for full response. Here's how to hit those targets:

Streaming

Perceived first-token

High

Smaller model

All latency

High

GPU region proximity

Network overhead

Medium

Prompt caching

Long system prompts

High

KV cache warming

Repeated context

Medium

Async processing

Non-blocking requests

Medium

⚡

Always stream. Streaming makes a 5-second response feel fast because users see output immediately. This is the single highest-leverage latency improvement with zero model changes.

4. Cost Control

AI inference can get expensive fast. These levers control the bill:

📏

Limit token usage

Set max_tokens aggressively. Most responses don't need 4096 tokens — cap at what your use case actually needs.

🗜️

Compress context

Use a sliding window or summarize old messages. Every input token costs money too, not just output.

⚡

Use cheaper models

Route simple tasks to Haiku/Flash, complex ones to Sonnet/Opus. A classifier + routing layer can cut costs 10x.

💾

Cache responses

For deterministic or repeat queries, cache the response. Identical prompts at temperature=0 always produce the same output.

model_router.py
def route_model(user_message: str) -> str:
    # Simple heuristic routing — use small model for short tasks
    token_estimate = len(user_message.split()) * 1.3

    if token_estimate < 200 and "summarize" not in user_message:
        return "claude-haiku-4-5"   # cheap & fast
    elif token_estimate < 1000:
        return "claude-sonnet-4-6"           # balanced
    else:
        return "claude-opus-4-5"             # complex tasks

5. Monitoring & Observability

You can't improve what you don't measure. Track these metrics from day one:

Performance

Time to first token (TTFT) — target <500ms

Tokens per second — measure throughput

p95/p99 latency — worst-case user experience

Cost

Input / output tokens per request

Cost per user per day

Cache hit rate

Quality

User thumbs up/down rate

Refusal / error rate

LLM-as-judge evals — automated quality scoring

⚠️

Log everything. Store full request/response pairs (within privacy constraints). You'll need them to debug regressions, understand costs, and build eval datasets.

6. Scaling Patterns

When your traffic grows, here's how to scale without rewriting everything:

Stage 1

Single instance — one API server, async endpoints, streaming. Handles 10–100 concurrent users comfortably.

Stage 2

Job queue — Celery + Redis or BullMQ. Decouple request acceptance from model inference. Handles bursts gracefully.

Stage 3

Horizontal scaling — multiple inference workers behind a load balancer. Add workers as traffic grows.

Stage 4

Multi-region — deploy inference close to users. Use a CDN for static assets, regional APIs for inference.

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5

When should you default to a managed API (Claude, GPT) instead of self-hosting?

Q2 of 5

What does quantization do to a model?

Q3 of 5

Why is streaming so effective for perceived latency?

Q4 of 5

What is the purpose of a job queue in an AI serving stack?

Q5 of 5

Which of these is NOT a recommended cost control strategy?

8. Series Complete

🎉

You've completed Building AI Model from Scratch. You now understand how models are trained, how generative AI works, how to work with LLMs, build chatbots, generate images, and deploy everything to production.

Here's what you covered across the 6 posts:

01Training loops, loss functions, gradient descent, architectures

02VAEs, GANs, Diffusion Models — the generative AI families

03Prompt engineering, fine-tuning, RAG, embeddings

04Chatbot with memory, tool use, streaming, production API

05Stable Diffusion, ControlNet, LoRA, image generation API

06Serving, optimization, latency, cost, monitoring, scaling

Deploying AIto Production

1. Serving Options

2. Model Optimization

3. Latency Strategies

4. Cost Control

5. Monitoring & Observability

6. Scaling Patterns

7. Knowledge Check

8. Series Complete

Deploying AI
to Production