Topics Building AI Model from Scratch
Series · 6 posts Contact

Building a
Chatbot

This is the post where theory becomes a product. We'll build a fully functional AI chatbot step-by-step — from a bare API call, adding conversation memory, then tool use (so it can take real actions), and finally deploying to production with streaming and error handling.

1. Your First API Call

Every chatbot starts with a single API call. Here's the minimal working version with Claude:

first_call.py
import anthropic client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[ {"role": "user", "content": "What is gradient descent?"} ] ) print(response.content[0].text)
💡

The messages array is stateless. Each API call is independent — the model has no memory of previous calls unless you explicitly include prior messages in the array.

2. Conversation Memory

To build a chatbot that remembers what was said, you must accumulate the message history and pass it with every request. This is the messages pattern.

chatbot_memory.py
import anthropic client = anthropic.Anthropic() history = [] # grows with each turn def chat(user_input): history.append({"role": "user", "content": user_input}) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system="You are a concise AI tutor.", messages=history # full history every time ) assistant_reply = response.content[0].text history.append({"role": "assistant", "content": assistant_reply}) return assistant_reply # Usage print(chat("What is a neural network?")) print(chat("Can you give me an example?")) # remembers the context

Managing context window limits

Context windows are finite. As conversations grow, you'll hit limits or pay more. Common strategies:

Sliding window — keep only the last N turns
Summarize & compress — periodically summarize old turns into a shorter summary message
Semantic retrieval — store all turns, retrieve only the relevant ones per query (like RAG for chat history)

3. Tool Use (Function Calling)

Tool use lets the model call your functions — search the web, query a database, send an email. You define the tools; the model decides when to use them.

tool_use.py
# 1. Define tools tools = [{ "name": "get_weather", "description": "Get current weather for a city", "input_schema": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"} }, "required": ["city"] } }] # 2. Send to model response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, tools=tools, messages=[{"role": "user", "content": "What's the weather in Tokyo?"}] ) # 3. Check if model wants to call a tool if response.stop_reason == "tool_use": tool_call = response.content[1] # ToolUseBlock city = tool_call.input["city"] # 4. Execute the real function weather_data = fetch_weather_api(city) # 5. Send result back final = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, tools=tools, messages=[ {"role": "user", "content": "What's the weather in Tokyo?"}, {"role": "assistant", "content": response.content}, {"role": "user", "content": [{ "type": "tool_result", "tool_use_id": tool_call.id, "content": str(weather_data) }]} ] )

Tool safety: always validate and sanitize tool inputs before executing them. The model can be prompted into calling tools with malicious inputs — treat tool calls like untrusted user input.

4. Streaming Responses

Streaming sends tokens to the client as they're generated. This dramatically improves perceived latency — users see output start immediately rather than waiting for the full response.

streaming.py
with client.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, messages=[{"role": "user", "content": "Explain transformers in detail"}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True)

5. Production Checklist

Rate limiting

Throttle per-user to prevent runaway API costs. Store usage in Redis or a DB.

Error handling & retries

Wrap API calls with exponential backoff. Handle 429 rate limit errors gracefully.

Input validation

Limit message length, strip control characters, reject obviously malicious inputs.

Logging & observability

Log every request with token counts and latency. You need this for debugging and cost attribution.

Persist conversation history

Store in a database (Postgres, DynamoDB). In-memory history dies on server restart.

6. Full Chatbot Example

A production-ready FastAPI chatbot with streaming, memory, and error handling:

chatbot_api.py
from fastapi import FastAPI from fastapi.responses import StreamingResponse import anthropic, json app = FastAPI() client = anthropic.Anthropic() sessions = {} # session_id -> message history @app.post("/chat/{session_id}") async def chat(session_id: str, body: dict): user_msg = body["message"][:2000] # cap input length history = sessions.setdefault(session_id, []) history.append({"role": "user", "content": user_msg}) async def generate(): full_text = "" try: with client.messages.stream( model="claude-sonnet-4-6", max_tokens=1024, system="You are a helpful AI assistant.", messages=history[-20:] # sliding window ) as stream: for text in stream.text_stream: full_text += text yield json.dumps({"delta": text}) + "\n" except anthropic.RateLimitError: yield json.dumps({"error": "Rate limit hit, please wait."}) + "\n" return history.append({"role": "assistant", "content": full_text}) return StreamingResponse(generate(), media_type="application/x-ndjson")

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5
Why does a basic LLM API have no memory between calls?
Q2 of 5
What does the model return when it wants to call a tool?
Q3 of 5
What is the main benefit of streaming API responses?
Q4 of 5
Why is it important to treat tool call inputs as untrusted?
Q5 of 5
A sliding window context strategy means...

8. What's Next

Your chatbot is live. Next: build an image generation pipeline with Stable Diffusion.

← Previous Working with LLMs
Next → Image Generation Pipeline