Building a
Chatbot

04 / 06 · 2026-06-30 · 22 min read · Chatbot Memory Tool Use Intermediate

This is the post where theory becomes a product. We'll build a fully functional AI chatbot step-by-step — from a bare API call, adding conversation memory, then tool use (so it can take real actions), and finally deploying to production with streaming and error handling.

In this post

1. Your First API Call 2. Conversation Memory 3. Tool Use (Function Calling) 4. Streaming Responses 5. Production Checklist 6. Full Chatbot Example 7. Knowledge Check 8. What's Next

1. Your First API Call

Every chatbot starts with a single API call. Here's the minimal working version with Claude:

first_call.py
import anthropic

client = anthropic.Anthropic()   # reads ANTHROPIC_API_KEY from env

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "What is gradient descent?"}
    ]
)

print(response.content[0].text)

💡

The messages array is stateless. Each API call is independent — the model has no memory of previous calls unless you explicitly include prior messages in the array.

2. Conversation Memory

To build a chatbot that remembers what was said, you must accumulate the message history and pass it with every request. This is the messages pattern.

chatbot_memory.py
import anthropic

client = anthropic.Anthropic()
history = []   # grows with each turn

def chat(user_input):
    history.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a concise AI tutor.",
        messages=history   # full history every time
    )

    assistant_reply = response.content[0].text
    history.append({"role": "assistant", "content": assistant_reply})
    return assistant_reply

# Usage
print(chat("What is a neural network?"))
print(chat("Can you give me an example?"))  # remembers the context

Managing context window limits

Context windows are finite. As conversations grow, you'll hit limits or pay more. Common strategies:

Sliding window — keep only the last N turns

Summarize & compress — periodically summarize old turns into a shorter summary message

Semantic retrieval — store all turns, retrieve only the relevant ones per query (like RAG for chat history)

3. Tool Use (Function Calling)

Tool use lets the model call your functions — search the web, query a database, send an email. You define the tools; the model decides when to use them.

tool_use.py
# 1. Define tools
tools = [{
    "name": "get_weather",
    "description": "Get current weather for a city",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
    }
}]

# 2. Send to model
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)

# 3. Check if model wants to call a tool
if response.stop_reason == "tool_use":
    tool_call = response.content[1]   # ToolUseBlock
    city = tool_call.input["city"]

    # 4. Execute the real function
    weather_data = fetch_weather_api(city)

    # 5. Send result back
    final = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        tools=tools,
        messages=[
            {"role": "user",      "content": "What's the weather in Tokyo?"},
            {"role": "assistant", "content": response.content},
            {"role": "user",      "content": [{
                "type": "tool_result",
                "tool_use_id": tool_call.id,
                "content": str(weather_data)
            }]}
        ]
    )

⚡

Tool safety: always validate and sanitize tool inputs before executing them. The model can be prompted into calling tools with malicious inputs — treat tool calls like untrusted user input.

4. Streaming Responses

Streaming sends tokens to the client as they're generated. This dramatically improves perceived latency — users see output start immediately rather than waiting for the full response.

streaming.py
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain transformers in detail"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

5. Production Checklist

✓

Rate limiting

Throttle per-user to prevent runaway API costs. Store usage in Redis or a DB.

✓

Error handling & retries

Wrap API calls with exponential backoff. Handle 429 rate limit errors gracefully.

✓

Input validation

Limit message length, strip control characters, reject obviously malicious inputs.

✓

Logging & observability

Log every request with token counts and latency. You need this for debugging and cost attribution.

✓

Persist conversation history

Store in a database (Postgres, DynamoDB). In-memory history dies on server restart.

6. Full Chatbot Example

A production-ready FastAPI chatbot with streaming, memory, and error handling:

chatbot_api.py
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic, json

app = FastAPI()
client = anthropic.Anthropic()
sessions = {}   # session_id -> message history

@app.post("/chat/{session_id}")
async def chat(session_id: str, body: dict):
    user_msg = body["message"][:2000]   # cap input length

    history = sessions.setdefault(session_id, [])
    history.append({"role": "user", "content": user_msg})

    async def generate():
        full_text = ""
        try:
            with client.messages.stream(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                system="You are a helpful AI assistant.",
                messages=history[-20:]   # sliding window
            ) as stream:
                for text in stream.text_stream:
                    full_text += text
                    yield json.dumps({"delta": text}) + "\n"
        except anthropic.RateLimitError:
            yield json.dumps({"error": "Rate limit hit, please wait."}) + "\n"
            return

        history.append({"role": "assistant", "content": full_text})

    return StreamingResponse(generate(), media_type="application/x-ndjson")

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5

Why does a basic LLM API have no memory between calls?

Q2 of 5

What does the model return when it wants to call a tool?

Q3 of 5

What is the main benefit of streaming API responses?

Q4 of 5

Why is it important to treat tool call inputs as untrusted?

Q5 of 5

A sliding window context strategy means...

8. What's Next

Your chatbot is live. Next: build an image generation pipeline with Stable Diffusion.