Claude Agent SDK

A programming toolkit for building autonomous AI agents — no need to write your own tool execution loop.

The Claude Agent SDK (formerly Claude Code SDK) is a Python and TypeScript library that lets you programmatically run an AI agent that can read files, run commands, edit code, and fetch web data — all automatically, just like an engineer working in a terminal. This SDK is the same engine that powers Claude Code itself.

What's the core idea?

Imagine hiring an AI contractor. Instead of standing over their shoulder issuing instructions step by step, the Agent SDK lets you state a goal once, and the AI figures out the steps, does the work, checks the result on its own. The SDK handles the entire loop for you.

Built-in Tools

Read files, run bash, edit code — no setup or custom tool execution needed.

Automatic Agent Loop

Claude iterates through think → act → verify until the goal is complete.

Context Management

Automatic context compaction keeps long tasks running without losing information.

Python & TypeScript

First-class support for both leading AI/data languages. CLI built-in, no extra installs.

Quick install

# Python
pip install claude-agent-sdk

# TypeScript / Node.js
npm install @anthropic-ai/claude-agent-sdk

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

import asyncio
from claude_agent_sdk import query, ClaudeAgentOptions

async def main():
    async for message in query(
        prompt="Find and fix the bug in auth.py",
        options=ClaudeAgentOptions(allowed_tools=["Read", "Edit", "Bash"])
    ):
        print(message)  # Claude reads → finds bug → edits → done

asyncio.run(main())

SDK vs API Client

Anthropic offers two ways to use Claude. Understanding the difference helps you pick the right tool for your project.

❌ Raw API Client — you write the loop

response = client.messages.create(...)

# You must implement this yourself
while response.stop_reason == "tool_use":
  result = your_tool_executor(response)
  response = client.messages.create(
    tool_result=result, ...
  )

✅ Agent SDK — Claude handles it

async for msg in query(
prompt="Fix the bug"
):
print(msg)

# Claude: read → edit → test → done

Criteria	Raw API Client	Agent SDK
Setup complexity	Low — plain API call	Medium — but does far more
Tool execution	You implement yourself	Claude handles it all
Best for	Chat apps, simple Q&A	Automation, CI/CD, data pipelines
When to use	Need per-step control	Need agent to work autonomously

💡

Pro tip: Many teams use both — the raw API client for interactive chat UIs, and the Agent SDK for background automation pipelines running unattended.

query() — Communicating with the Agent

query() is the central function of the SDK. Send it a task in natural language and receive a stream of messages as the agent works.

Basic syntax

from claude_agent_sdk import query, ClaudeAgentOptions, AssistantMessage, TextBlock

async for message in query(
    prompt="Analyze sales.csv and generate a summary report",
    options=ClaudeAgentOptions(
        system_prompt="You are an expert data analyst.",
        max_turns=10,              # max 10 action turns
        allowed_tools=["Read", "Bash"],
    )
):
    if isinstance(message, AssistantMessage):
        for block in message.content:
            if isinstance(block, TextBlock):
                print(block.text)

Key parameters

Parameter	Meaning	Example
`prompt`	Task to send to the agent	"Fix bug in auth.py"
`system_prompt`	Define the agent's role	"You are a senior Python backend engineer"
`max_turns`	Limit on action turns	10, 20, 50
`allowed_tools`	Tools the agent may use	["Read", "Edit", "Bash"]
`disallowed_tools`	Block specific tools	["Bash"] — prevent shell commands

Message types returned

Type	Meaning
`AssistantMessage`	Claude's reply or reasoning text
`ToolUseMessage`	Claude is calling a tool (reading a file, running a command…)
`ResultMessage`	Final result after the agent finishes
`SystemMessage`	System notification (context compacted, error…)

Built-in Tools

The SDK ships with tools Claude can use immediately — no custom code needed. These are the agent's "hands".

Read — Read files

Read any file in the project: code, CSV, JSON, logs, config. Claude understands the content and uses it as context.

Edit / Write — Modify & Create

Edit code files with surgical patches, or create new files from scratch. Supports undo through checkpointing.

Bash — Run shell commands

Execute any shell command: run tests, install packages, call APIs, process data with pandas, and more.

Glob / Grep — Search

Find files by pattern, search text across the codebase. Helps the agent orient quickly in large projects.

WebFetch — Get web data

Access URLs, read online docs, fetch APIs — the agent can find documentation or pull real-world data.

TodoWrite — Task tracking

The agent creates and tracks its own to-do list to stay on course through long multi-step projects.

⚠️

Safety note: By default Claude will ask for confirmation before editing files or running commands. You can customize this via permission_mode to fully automate once you trust the environment.

Custom Tools & MCP

Beyond built-in tools, you can give the agent your own custom tools or connect it to external services via MCP.

Custom Tool in Python

Define a regular Python function, attach the @tool decorator, and register it with the agent. Claude will know when to call it.

from claude_agent_sdk import tool, create_sdk_mcp_server, ClaudeSDKClient

# Define your own tool
@tool("query_database", "Query the internal database", {"sql": str})
async def run_sql(args):
    result = your_db.execute(args["sql"])
    return {"content": [{"type": "text", "text": str(result)}]}

# Wrap as an internal MCP server (no separate process needed)
server = create_sdk_mcp_server(
    name="data-tools", version="1.0", tools=[run_sql]
)

# Pass it to the agent
client = ClaudeSDKClient(mcp_servers=[server])
await client.send("Analyze Q4 revenue from the database")

Connecting an External MCP Server

MCP (Model Context Protocol) is Anthropic's open tool-connection standard. You can connect Slack, GitHub, Notion, external databases with just a few lines of config.

options = ClaudeAgentOptions(
    mcp_servers=[
        {"name": "github",  "url": "https://mcp.github.com/sse"},
        {"name": "postgres", "url": "http://localhost:5432/mcp"},
    ],
    strict_mcp_config=True  # only use declared servers, ignore global config
)

Sessions & Context Management

Agents can work for hours continuously. Sessions help you manage memory, resume work after interruption, and avoid losing context when conversations grow long.

The problem: Context Window has limits

ℹ️

Claude can only "remember" a certain amount of text in a single working session (the context window). For large projects requiring many hours, the SDK automatically compacts context to retain important information without exceeding the limit.

Resume a session after stopping

from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions

# Run 1: create session and save the ID
client = ClaudeSDKClient()
await client.send("Start refactoring the authentication module")
session_id = client.session_id  # save this

# Run 2: continue from exactly where it left off
client2 = ClaudeSDKClient(options=ClaudeAgentOptions(resume=session_id))
await client2.send("Continue — write unit tests for what you just refactored")

Eager session flushing for live UIs

Use session_store_flush="eager" if you need to display output in real-time to a terminal or dashboard — data is pushed immediately instead of waiting until the end of a turn.

Hooks — Controlling Agent Behavior

Hooks let you intercept between agent steps to inspect, block, or modify actions — like a security layer you control.

Pre-tool Hook (before tool use)

async def safety_check(context):
    # Block the agent from deleting files
    if context.tool_name == "Bash" and "rm -rf" in context.tool_input:
        return {"decision": "block", "reason": "File deletion not permitted"}
    # Auto-approve safe commands
    if context.tool_name in ["Read", "Glob"]:
        return {"decision": "approve"}

client = ClaudeSDKClient(pre_tool_use_hook=safety_check)

Post-tool Hook (after tool use)

Use to log actions, sanitize output before Claude sees it, or write an audit trail for compliance.

Pre-tool Hook

Block or approve before the agent acts. Use for security guardrails, audit logging, and sandbox enforcement.

Post-tool Hook

Intercept after the tool runs. Modify the output before Claude sees the result, or record it for compliance.

Streaming Output

The agent returns results as a stream — you receive each message the moment Claude finishes it, without waiting for the full task to complete.

Why streaming matters

For complex tasks (refactoring an entire codebase, analyzing a large dataset), the agent may run for several minutes. Streaming lets you display actual progress in a terminal or dashboard rather than staring at a blank screen.

async for message in query(prompt="Run the full test suite and fix failures"):
    match message:
        case AssistantMessage():
            print(f"🤖 Claude: {message.content[0].text}")
        case ToolUseMessage():
            print(f"🔧 Using tool: {message.name}")
        case ResultMessage(is_error=True):
            print(f"❌ Error: {message.error_text}")
        case ResultMessage():
            print(f"✅ Done after {message.num_turns} turns")

Structured Output

If you need JSON output (to process further in code), prompt Claude to return JSON and parse it:

result = await get_structured(
    prompt="Analyze the file. Return JSON: {errors: [], warnings: [], summary: ''}"
)
data = json.loads(result)

Use Case: AI Coding Agent

Apply the Claude Agent SDK to automate the full software development lifecycle — from code review to deployment.

AI CODING Auto Bug Fixer Agent

Scenario: CI/CD detects a test failure on GitHub. Instead of an engineer debugging manually, the agent reads the stack trace, finds the root cause, fixes the code, re-runs tests, and opens a PR automatically.

Workflow

Receive CI trigger

A GitHub Action calls the SDK when tests fail, passing in the error log and commit ID.

Agent reads the codebase

Uses Read + Grep to understand the context and find files related to the error.

Fix the code

Uses Edit to apply a surgical patch. No deletions, only the failing line is changed.

Re-run tests

Uses Bash to run the test suite. If still failing, loops back to step 2 (up to max_turns).

Create PR

Uses Bash + GitHub CLI to commit and open a Pull Request with a detailed description of the bug and fix.

async def auto_fix_bug(error_log: str, repo_path: str):
    async for msg in query(
        prompt=f"""
        Test suite failed with the following error:
        {error_log}

        Tasks:
        1. Read the stack trace and find the root cause
        2. Fix the code (only the broken part)
        3. Re-run: pytest tests/
        4. If passing: git add -A && git commit -m 'fix: ...'
        5. gh pr create --title 'Auto fix' --body 'Fixed by Claude Agent'
        """,
        options=ClaudeAgentOptions(
            allowed_tools=["Read", "Edit", "Bash", "Grep"],
            max_turns=20,
            system_prompt="You are a senior Python engineer. Fix bugs precisely, no unnecessary refactoring."
        )
    ):
        print(msg)

✅

Use hooks for safety: Add a pre_tool_use_hook to block dangerous commands like git push --force or rm -rf before the agent accidentally runs them.

SDK features used

Feature	Role in this project
`query()` + `allowed_tools`	Restrict agent to read/edit code only, nothing out of scope
`max_turns`	Prevent infinite loops when a bug is complex
`Hooks`	Audit log every file change for later review
`Sessions`	Continue a large refactor across multiple runs
`ResultMessage`	Know definitively whether the agent succeeded or timed out

Use Case: Data Analytics Pipeline

Use the Agent SDK to build an automated large-scale data pipeline — from cleaning to visualization — without a data engineer running each step manually.

DATA ANALYTICS Automated Data Pipeline Agent

Scenario: Every morning, the system receives a fresh CSV from the data warehouse (~500k rows). The agent automatically checks data quality, cleans it, calculates KPIs, generates a PDF report with charts, and posts to Slack.

Workflow

Read & explore data

Uses Read + Bash to run pandas.describe(), understand structure, detect nulls and outliers.

Data cleaning

Agent writes a Python script to handle missing values, normalize formats, and deduplicate. Runs and verifies the result.

Calculate KPIs

Runs aggregation queries: revenue, CAC, churn rate, top products by region. Outputs JSON.

Create visualizations

Writes and executes matplotlib/plotly scripts to generate charts. Saves PNG/HTML.

Send report

Uses a Custom Tool to call the Slack API with the summary and chart attachments for the team.

async def morning_pipeline(csv_path: str):
    async for msg in query(
        prompt=f"""
        New data file: {csv_path}

        Pipeline:
        1. Read and summarize (rows, columns, null %)
        2. Clean: handle nulls per domain rules, normalize date format
        3. Calculate KPIs: total revenue, by region, top 10 products
        4. Generate bar chart revenue by region, save to /reports/today.png
        5. Call send_slack #data-team with summary and chart path
        """,
        options=ClaudeAgentOptions(
            allowed_tools=["Read", "Write", "Bash"],
            max_turns=30,
            mcp_servers=[slack_server],
        )
    ):
        print(msg)

Use Case: Legal Document Review

Automate contract review — find risk clauses, flag unfavorable terms, compare against standard templates. Reduces junior lawyer time by 80%.

LEGAL TECH Contract Risk Analyzer

Scenario: Company receives 50–200 contracts per month from partners. Previously each took 2–4 hours for a lawyer to review. Agent reads the PDF, cross-references the internal legal playbook, tags each clause by risk level, and generates a memo for senior lawyer review.

Workflow

Read & parse contract

Uses Read to read the PDF. Extracts sections: Parties, Term, Payment, Liability, IP Ownership, Termination, Governing Law.

Compare against standard playbook

Loads legal playbook via Memory/RAG. Agent compares each clause with the standard template, tagging: ACCEPTABLE / NEGOTIABLE / RED_FLAG.

Financial risk analysis

Calculates maximum exposure: penalty clauses, liability cap, indemnification scope. Outputs absolute dollar amounts.

Generate legal memo

Writes a concise memo for the senior lawyer: executive summary, top 3 risks, recommended redlines, overall risk score.

Log to Case Management

Calls a Custom Tool to push results to the case management system (Clio / Notion / Jira Legal).

✅

Real-world ROI: At 100 contracts/month × 3h × $150/h junior lawyer = $45,000/month saved. Agent cost estimate: ~$200–400/month.

Use Case: Customer Support Triage

Automatically classify, prioritize, and resolve support tickets — the agent handles 70% of tier-1 tickets without human intervention, escalating complex cases with full context.

CUSTOMER SUCCESS Intelligent Ticket Resolver

Scenario: B2B SaaS receives 500–2000 tickets/day via email/Intercom. 70% are repetitive questions (billing, account, basic how-to). Agent reads each ticket, looks up the knowledge base + account history, replies or escalates with full context for the human agent.

3-tier processing architecture

Classify & route (Haiku — fast, cheap)

Receive ticket → classify intent + urgency + sentiment (1–5) + tier: TIER1_AUTO / TIER2_HUMAN / TIER3_URGENT. Cost: ~$0.001/ticket.

Context enrichment (Sonnet)

For TIER1: pull account data via Custom Tool, find related tickets (vector search), load relevant KB articles. Compose a full reply.

Resolve or escalate

TIER1: send reply directly + close ticket. TIER2/3: escalate with summary — what the agent found and recommended action — human agent understands in 30 seconds.

✅

Real numbers: 1000 tickets/day, TIER1 ~70% = 700 tickets. Cost: Haiku classify 1000 + Sonnet resolve 700 ≈ $8–12/day vs. 5 human agents × $30/day = $150.

Use Case: Security Threat Detection

Agent continuously reads logs, correlates events, detects attack patterns, generates incident reports, and triggers automated response — an intelligent SIEM without rule-based limitations.

CYBERSECURITY Autonomous Threat Hunter

Scenario: A 3-person security team manages 200-server infrastructure. Not enough people to read logs 24/7. Agent runs every 15 minutes, reads aggregated logs, finds anomalies, correlates with threat intelligence, automatically blocks suspicious IPs, and generates an incident report for the on-call engineer.

Workflow

Ingest & normalize logs

Reads log batches from S3/Elasticsearch. Runs a bash normalization pipeline to standardize format from multiple sources (nginx, auth, firewall, app).

Anomaly detection

Uses Bash to run statistical analysis: IP request rate, failed auth spikes, unusual port activity, impossible geo-velocity travel. Flags outliers with Z-score > 3.

Threat intelligence correlation

WebFetch checks flagged IPs against VirusTotal / AbuseIPDB. Cross-references with internal blocklist via Custom Tool.

Automated response

If confidence is high: calls Custom Tool to block IP on the firewall automatically. If uncertain: escalates to human with evidence packet.

Incident report

Generates report in MITRE ATT&CK framework: Tactic / Technique / Evidence / Impact / Remediation. Sends to PagerDuty + Slack #security.

⚠️

Always audit with Hooks: Every block_ip call must be logged with timestamp + reasoning. Pre-tool hook should require a minimum confidence score before executing. Human oversight is critical here.

Use Case: Content Production Factory

Multi-channel marketing content pipeline — from a single brief, the agent produces a blog post, 10 social posts, an email newsletter, and SEO metadata, all consistent in voice and brand.

MARKETING Multi-Channel Content Generator

Scenario: A SaaS startup needs 3–4 content pieces/week across blog, LinkedIn, Twitter/X, and email newsletter. One content writer can't keep up. Agent receives a brief from the product team, researches, creates multi-channel drafts, optimizes for SEO, saves to the CMS — the writer only reviews and approves.

Workflow

Research phase

WebFetch reads the top 5 articles on the topic. Grep through the internal blog archive to avoid duplicates. Builds an outline based on gap analysis.

Long-form blog (Opus/Sonnet)

Creates a 1500–2500 word post with standard SEO structure: H1/H2/H3, internal link suggestions, meta description, target keyword density.

Social content (Haiku — fast)

Extracts the 10 best quotes/insights from the blog. Reformats for LinkedIn (professional), Twitter/X (punchy, hooks), and Threads (conversational).

Email newsletter

Summarizes blog into a 300-word newsletter with CTA, 3 A/B subject line variants, and preview text.

Publish to CMS

Custom Tool pushes draft to Contentful/WordPress with correct tags, categories, and scheduled publish date. Notifies the content writer via Slack.

async def content_factory(brief: dict):
    # Orchestrator creates the blog first
    blog_content = await create_blog(brief)

    # 3 subagents run in parallel from the blog content
    social, email, seo = await asyncio.gather(
        create_social_posts(blog_content, platforms=["linkedin", "twitter", "threads"]),
        create_newsletter(blog_content),
        optimize_seo(blog_content, target_keyword=brief["keyword"]),
    )
    return {"blog": blog_content, "social": social, "email": email, "seo": seo}

Use Case: Employee Onboarding Agent

New employees get a personal AI agent — it reads the entire wiki, policy docs, and codebase, and answers any onboarding question 24/7 with full context about the company and their specific role.

HR TECH Personal Onboarding Companion

Scenario: A 200-person tech company hires 5–10 people per month. Each new hire takes 2–4 weeks to become productive. HR spends 40% of their time answering the same questions repeatedly. The agent is personalized by role (engineer/designer/sales), has the company's full knowledge base, and learns from each interaction.

Personalized by role

Engineer onboarding focuses on codebase, dev setup, architecture. Designer: Figma, design system, brand. Sales: CRM, playbook, competitor intel.

Dynamic knowledge base

Indexes the full Notion wiki + Confluence + GitHub READMEs + policy docs. Vector search on employee questions. Self-updates when KB changes.

Per-employee memory

Remembers what has been asked and explained. No repeats. Tracks each person's onboarding checklist completion.

Proactive check-ins

End of days 1, 3, 7, 14, 30: automatically asks if there are blockers. Summarizes gaps for the HR manager to review.

✅

Dual ROI: Saves HR time (answering repetitive questions) + shortens time-to-productive from 4 weeks to 2 weeks. At $5k/employee onboarding cost, 10 hires/month = $25k–30k/month saved.

Architecture Diagrams — How It All Connects

Understand how the Agent SDK, Claude model, tools, and your client interact with each other.

Agent Loop — Overall Diagram

Server Stdout → Client Flow

CLI / Stdout Demo — Real World

See how the agent operates in the terminal — from the invocation to each streamed message.

Method 1: Use the CLI directly (no code)

bash — claude CLI

$ claude -p "Read data.csv and tell me how many null rows there are"

⚙ [tool_use] Read(path="data.csv")

⚙ [tool_use] Bash("python3 -c \"import pandas as pd; df=pd.read_csv('data.csv'); print(df.isnull().sum())\"")

→ customer_id 0

→ revenue 143

→ region 27

✔ [assistant] File data.csv has 170 null values total: 'revenue' missing 143 rows, 'region' missing 27 rows. Would you like me to handle them?

Method 2: Python SDK with stream output

python run_agent.py

$ python run_agent.py

[SystemMessage] Agent starting, context window: 200k tokens

[ToolUse] Read → auth.py

✓ Read 247 lines

[ToolUse] Bash → pytest tests/test_auth.py -v

✗ FAILED test_login_with_expired_token - AssertionError

[Assistant] Found bug: verify_token() doesn't check expiry timestamp. Will fix line 142.

[ToolUse] Edit → auth.py (lines 140-145)

✓ Patch applied

[ToolUse] Bash → pytest tests/test_auth.py -v

✓ 8 passed in 1.23s

[ResultMessage] ✅ Done in 5 turns | Cost: ~$0.04 | 1 file patched

Method 3: Subprocess — call the agent from any language

ℹ️

Because the agent communicates via JSON lines on stdout, you can call it from Go, Ruby, Java... by spawning a subprocess and reading stdout line by line.

# Run agent non-interactively, output JSON to stdout
claude --print --output-format json-stream \
  -p "Check errors in src/ and list affected files" \
  | jq '.message'  # parse with jq

Live Console Animation

Real-time simulation of an agent running. Choose a scenario and click ▶ Run to watch each step appear like a real terminal.

Bug Fix Agent — claude-agent-sdk

idle

LLM Pricing Calculator

Calculate the real cost of your Agent SDK pipeline. Adjust task volume, model mix, and token usage to estimate monthly cost.

📋 Pipeline parameters

Tasks / day

500

Active days / month

🤖 Model Mix (%)

Haiku 4.5 — classify/filter

40%

Sonnet 4.6 — analyze/write

50%

Opus 4.6 — deep reasoning

10%

🔢 Token usage / task

Input tokens (avg)

8,000

Output tokens (avg)

2,000

⚡ Optimizations

Prompt Caching (cache ~40% input tokens)

Batch API (50% discount, latency +minutes)

Tool Search (reduce 90% skill context tokens)

Estimated cost / month

$0.00

$0.00 / day · $0.00 / task

Breakdown by model

● Haiku 4.5

$0.00

● Sonnet 4.6

$0.00

● Opus 4.6

$0.00

Savings from optimizations

Prompt Caching-$0.00

Batch API-$0.00

Tool Search-$0.00

Total savings-$0.00

Compare with alternatives

Your pipeline

Load-all (no opt.)

All-Opus pipeline

Human team (~5 FTE)

$30,000

💡 Calculating...

Minor Features

Features rarely needed in basic projects but valuable in production.

Observability & Monitoring

OpenTelemetry: Integrate trace/span to track agents in monitoring systems (Datadog, Jaeger).
Cost Tracking: View per-request token cost via ResultMessage.usage.
Todo Tracking: Agent auto-creates a todo list to stay on track in long tasks (via TodoWrite tool).

File & State Management

File Checkpointing: Automatically snapshots files before Edit. Can rewind (undo) if agent breaks something.
Session Store: Saves conversation transcript for resume after crash or step-by-step debugging.

Advanced Customization

Slash Commands: Define shortcuts like /reset, /status to control the agent mid-session.
Skills: Package sets of instructions/tools into a reusable "skill" across multiple projects.
Plugins: Extension point for platform-level customization.
Subagents: Main agent spawns child agents that run in parallel (parallelism).
Tool Search: When you have hundreds of tools, the agent auto-finds the right ones instead of loading all of them.
Effort Level: Adjust effort from low to xhigh (for Opus models).
Strict MCP Config: Lock the exact list of MCP servers the agent can use, preventing unintended connections.

Multi-cloud Support

Amazon Bedrock: Use CLAUDE_CODE_USE_BEDROCK=1 to route through AWS instead of Anthropic directly.
Google Vertex AI: Use CLAUDE_CODE_USE_VERTEX=1 for Google Cloud.
Microsoft Azure AI Foundry: Use CLAUDE_CODE_USE_FOUNDRY=1 for Azure.

Deploy & Billing

What you need to know when taking Agent SDK to production.

Billing Notice (from 15/6/2026)

⚠️

Important change: From 15/6/2026, Agent SDK and claude -p will use a separate credit pool instead of sharing the subscription pool. Programmatic usage is no longer subsidized as before.

Usage type	Pool	Notes
Interactive chat (claude.ai)	Subscription pool	Unchanged from before
Agent SDK / `claude -p`	Agent SDK credit pool	Fixed monthly credit, then pay API rate
GitHub Actions	Agent SDK credit pool	Same as above

Production best practices

Use Hooks to sandbox

Always add a pre-tool hook blocking dangerous commands when deploying on production servers.

Set reasonable max_turns

Avoid infinite loops that burn money. For simple tasks use 5-10, complex ones use 20-30.

Monitor cost with ResultMessage

Log message.usage from ResultMessage to track cost per task.

Use strict_mcp_config=True

When deploying, lock the MCP server list. Prevents agents from connecting to tools outside the whitelist.

ℹ️

Managed Agents: If you don't want to manage infrastructure yourself, Anthropic also offers Managed Agents — a hosted REST API where Anthropic handles all sandbox and execution. Suitable for rapid scaling.

Optimal Architecture — Multi-Skill, Multi-Document

When a project has dozens of Skills and thousands of documents, you can't load everything into context at once. This architecture describes how to intelligently layer things so the agent works accurately with minimal token cost.

⚠️

Core problem: 500 skills + 10,000 documents = millions of tokens if loaded in full. The 200k token context window fills immediately. Solution: only load what's needed, when needed — via embedding search + tool search.

4-Layer Architecture Map

Core Architecture Principles

Load only what's needed — Tool Search

Instead of injecting 500 skills into context, the agent uses tool_search to find top-k via embedding. Saves 90% context vs load-all.

Parallel subagents — Isolated Context

Each subagent has its own context. Runs in parallel, sends only the core result back to the orchestrator.

Model routing saves cost

Haiku ($) for classify, Sonnet ($$) for analyze, Opus ($$$) only when complex reasoning is truly needed.

Memory persists — no re-learning

Results saved to graph + vector store. The next session loads exactly the right context without re-explaining.

Scan · Embed · Store — How Agents Read Large Data

When you have thousands of documents and skills, you need a data preparation pipeline before the agent runs. Here's the 3-step process.

1. Scan — Crawl and Chunk

Agentic Scan (preferred)

Use Read + Grep + Glob directly. Higher accuracy than semantic, easier to debug. Use for exact match or codebase search.

Semantic Scan (supplemental)

Chunk text → embed → vector search. Faster at large scale. Use for 10k+ documents.

ℹ️

Anthropic recommends: Start with agentic search. Only add semantic search when you need speed or scale is too large to Read files one by one.

2. Embed — Convert to Vectors

import voyageai
from qdrant_client import QdrantClient

vo = voyageai.Client()
qdrant = QdrantClient(url="http://localhost:6333")

# 1. Chunk documents (500 tokens/chunk, overlap 50)
chunks = chunk_documents(docs, size=500, overlap=50)

# 2. Embed with Voyage AI (Anthropic recommended)
doc_embeddings = vo.embed(
    [c.text for c in chunks],
    model="voyage-4", input_type="document"
).embeddings

# 3. Store in Qdrant
qdrant.upload_points(
    collection_name="project_docs",
    points=[{"id": i, "vector": emb,
             "payload": {"text": chunks[i].text, "source": chunks[i].source}}
            for i, emb in enumerate(doc_embeddings)]
)

3. Memory Store Layers

Memory Type	Used For	Technology	Token Cost
Short-term	Current context window	SDK context + compaction	High — pay per token
Vector store	Document chunks, skill embeddings	Qdrant, pgvector	Low — query-time only
Graph store	Entities, relations, cross-doc links	cognee, Neo4j	Very low — traversal
File memory	MEMORY.md, progress notes	File system	~0 — plain text read
Session store	Transcript, resume state	SQLite	0 — offline

Tool Search for Skills — 90% context savings

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")  # local, free

# Offline: embed all skill descriptions once
skill_texts = [f"{s.name}: {s.description}" for s in all_skills]
skill_vectors = model.encode(skill_texts)  # [N_skills, 384]

def search_skills(query: str, top_k: int = 5, min_score: float = 0.70):
    q_vec = model.encode([query])[0]
    scores = np.dot(skill_vectors, q_vec)  # cosine similarity
    top_idx = np.argsort(scores)[::-1][:top_k]
    return [all_skills[i] for i in top_idx if scores[i] >= min_score]

Agent Scoring & Predefined Rubric

How do you know if the agent answered well? A rubric is a pre-defined set of rules — the agent "takes the test" against the rubric, it doesn't grade itself.

What is a Rubric?

Rubric = Teacher's Answer Key

Just like a teacher grades papers against an answer key, a Rubric is a set of criteria written in advance by domain experts. The agent doesn't know the rubric exists — it just does the work, while the evaluator grades using the rubric.

❌ Without Rubric

Agent self-grades its own output → bias
Different score every run → inconsistent
Don't know exactly which part is weak to improve
No objective threshold to decide retry vs escalate

✅ With Rubric

Third party (evaluator) grades using fixed criteria
100% consistent — same output always gives same score
Knows exactly which dimension failed (relevance? accuracy?)
Automatic: PASS → deliver, RETRY → re-run, ESCALATE → human

Rubric Structure

Dimensions — Independent evaluation axes. Each dimension has a weight and its own scoring method. Example: Relevance 35%, Accuracy 30%, Completeness 20%, Cost 15%.

Scoring criteria — Specific descriptions of what 1.0 looks like, 0.7, and 0.4. No vague "good" or "bad" — concrete, unambiguous criteria.

Validators — Programmatic checks: cross-check SQL, range sanity, required fields. Runs code, doesn't ask AI — so results are 100% objective.

Thresholds + Gates — Decision thresholds: ≥ 0.75 → PASS, 0.60–0.74 → WARN, 0.45–0.59 → RETRY (max 2x), < 0.45 → ESCALATE human.

💡

Who writes the Rubric? Domain experts — not developers. Lawyers write rubrics for legal agents. Data analysts write rubrics for data pipelines. A Rubric is where domain knowledge is encoded into machine-readable rules. Once written, the whole team shares it, version-controlled like code.

Scoring Pipeline Architecture

Rubric YAML — Defining the Scale

# Rubric for data analysis task
task_type: data_analysis
version: "2.1"

dimensions:
  relevance:
    weight: 0.35
    description: "Does the result actually answer the question?"
    scoring:
      1.0: "Answers directly, correct metric asked for"
      0.7: "Answered but missing some dimensions"
      0.4: "Partially off-topic"

  accuracy:
    weight: 0.30
    validators:
      - type: cross_check_sql
      - type: range_sanity  # revenue cannot be negative

  completeness:
    weight: 0.20
    required_fields: [summary, kpis, anomalies, recommendation]

  cost_efficiency:
    weight: 0.15
    max_tokens_per_task: 50000
    penalty_over_budget: 0.5

thresholds:
  pass: 0.75
  warn: 0.60
  retry: 0.45
  escalate: 0.30

retry_strategy:
  max_retries: 2
  on_fail_dimension: relevance  # re-prompt focused on failing dimension

Scoring Methods — Implementation Tabs

Each rubric dimension is scored using a different method. No single method is "best" — what matters is matching the dimension's nature.

Method: LLM Judge (Haiku)

Relevance can't be checked with pure code — it needs semantic understanding. We use Haiku (cheap, fast) as a "sub-judge" — prompt it to score against rubric criteria.

import anthropic

async def score_relevance(question: str, agent_output: str, criteria: dict) -> float:
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""
Original question: {question}

Agent output:
{agent_output[:2000]}

Score Relevance (0.0 - 1.0) on this scale:
1.0 = {criteria['scoring'][1.0]}
0.7 = {criteria['scoring'][0.7]}
0.4 = {criteria['scoring'][0.4]}
0.0 = Completely irrelevant

Return only a single decimal number, no explanation.
"""
        }]
    )
    try:
        score = float(response.content[0].text.strip())
        return max(0.0, min(1.0, score))
    except:
        return 0.5  # fallback on parse error

ℹ️

Cost: Haiku ~$0.00025/call here. At 500 tasks/day, relevance scoring costs only ~$0.12/day.

Method: Programmatic Validators

Accuracy uses pure code — no AI needed. Results are 100% objective, runs in ms instead of seconds, no token cost.

def validate_range_sanity(output: dict) -> float:
    violations = 0
    checks = 0
    for key in ["revenue", "gmv", "arr"]:
        if key in output.get("kpis", {}):
            checks += 1
            if output["kpis"][key] < 0:
                violations += 1
    if checks == 0: return 1.0
    return 1.0 - (violations / checks)

def validate_cross_check_sql(output: dict, source_db) -> float:
    try:
        sql_result = source_db.query(
            "SELECT SUM(revenue) FROM orders WHERE date >= %s",
            [output["period_start"]]
        )
        expected = sql_result[0][0]
        reported = output["kpis"]["revenue"]
        diff_pct = abs(reported - expected) / expected
        if   diff_pct < 0.001: return 1.0
        elif diff_pct < 0.01:  return 0.85
        elif diff_pct < 0.05:  return 0.6
        else:                   return 0.2
    except:
        return 0.5

Method: Field Presence Check

Completeness simply checks: are the required fields present in the output? No AI, no SQL — pure dictionary check.

def score_completeness(output: dict, required_fields: list) -> float:
    found = 0
    for field in required_fields:
        parts = field.split(".")
        val = output
        try:
            for p in parts: val = val[p]
            has_value = val is not None and val != "" and val != []
            if has_value: found += 1
        except KeyError:
            pass
    return found / len(required_fields)

Method: Token Budget Check

Cost Efficiency measures whether the agent consumed tokens within the allowed budget. Taken from ResultMessage.usage in the SDK.

def score_cost_efficiency(result_message, max_tokens: int, penalty: float = 0.5) -> float:
    usage = result_message.usage
    total = usage.input_tokens + usage.output_tokens
    if total <= max_tokens:
        usage_pct = total / max_tokens
        return 1.0 - (usage_pct * 0.15)
    else:
        overrun_pct = (total - max_tokens) / max_tokens
        base_score = 1.0 * penalty
        return max(0.1, base_score - overrun_pct * 0.2)

Method: Weighted Sum + Gate

Combine all dimensions into a single score via weighted sum. Then compare against thresholds to produce a verdict.

async def evaluate_with_rubric(question, agent_output, result_msg, rubric_path) -> RubricResult:
    rubric = yaml.safe_load(open(rubric_path))
    dims = rubric["dimensions"]
    scores = {}
    scores["relevance"]      = await score_relevance(question, str(agent_output), dims["relevance"])
    scores["accuracy"]       = score_accuracy(agent_output, dims["accuracy"]["validators"])
    scores["completeness"]   = score_completeness(agent_output, dims["completeness"]["required_fields"])
    scores["cost_efficiency"] = score_cost_efficiency(result_msg, dims["cost_efficiency"]["max_tokens_per_task"])
    final = sum(scores[d] * dims[d]["weight"] for d in scores)
    t = rubric["thresholds"]
    verdict = ("PASS" if final >= t["pass"] else
               "WARN" if final >= t["warn"] else
               "RETRY" if final >= t["retry"] else "ESCALATE")
    failed = [d for d, s in scores.items() if s < 0.6]
    return RubricResult(scores, final, verdict, failed)

Interactive Scoring Calculator

Enter scores for each dimension to see how the weighted sum and verdict work in real time.

Relevance× 0.35

0.88 0.308

Accuracy× 0.30

0.95 0.285

Completeness× 0.20

1.00 0.200

Cost Efficiency× 0.15

0.85 0.128

0.308 + 0.285 + 0.200 + 0.128 = 0.921

✅ PASS 0.921

Thresholds

PASS

WARN

RETRY

00.450.600.751.0

Radar Chart — Score by Dimension

Console Demo — Full End-to-End Run

Watch the entire pipeline run in a terminal: from receiving the prompt, scanning documents, selecting skills, spawning subagents, to scoring the result.

Step 1 — Setup & Index Data

bash — setup

$ pip install claude-agent-sdk qdrant-client sentence-transformers voyageai pyyaml

✓ Installed 5 packages

$ docker run -d -p 6333:6333 qdrant/qdrant

✓ Qdrant vector DB running on :6333

$ python scripts/index.py --docs ./knowledge_base/ --skills ./skills/

📄 Scanning 2,847 documents...

✂️ Chunked → 18,432 chunks (avg 480 tokens)

🔢 Embedding with voyage-4... [████████████] 100%

💾 Stored 18,432 vectors in Qdrant

🛠️ Indexed 47 skills with MiniLM embeddings

✅ Index ready in 4m 22s

Step 2 — Run Pipeline with Real Prompt

python run_pipeline.py

$ python run_pipeline.py --prompt "Analyze Q4/2025 revenue, compare Q3, find anomalies and suggest actions"

━━━ LAYER 1: ROUTING ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Orchestrator] Intent: data_analysis + comparison + anomaly_detection

[Token Budget] Allocated: skills=8k | docs=12k | exec=40k | reserve=10k

[Cache Check] System prompt cached ✓ (saves ~3k tokens/call)

━━━ LAYER 2: RETRIEVAL ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Tool Search] Cosine search over 47 skills...

1. data-analysis 0.94 ✓ selected

2. sql-query 0.88 ✓ selected

3. visualization 0.81 ✓ selected

4. report-writing 0.76 ✓ selected

5. forecasting 0.62 — skipped (below 0.70)

→ 4,200/8,000 skill tokens used (52%)

[Vector Search] Searching 18,432 chunks in 12ms...

revenue_Q4_2025.csv sim=0.92 ✓

revenue_Q3_2025.csv sim=0.89 ✓

regional_breakdown.xlsx sim=0.84 ✓

budget_plan_2025.pdf sim=0.68 — skipped

━━━ LAYER 3: EXECUTION (3 subagents parallel) ━━━━━━━━━━

[Subagent A] START → data-analysis | claude-sonnet-4-6

[Subagent B] START → sql-query | claude-haiku-4-5

[Subagent C] START → anomaly-detect | claude-sonnet-4-6

[Subagent B] Read(revenue_Q4.csv) → 12,847 rows

[Subagent B] Bash("python3 kpi_calc.py") → Revenue Q4: $142.3M | Δ+11.1% vs Q3

[Subagent B] DONE 18s | 8,420 tokens

[Subagent C] Bash("python3 anomaly.py --zscore 2.5")

⚠️ APAC Nov-15: revenue spike +340%

Memory cross-ref: APAC adjustment rule → not an error

[Subagent C] DONE 24s | 11,200 tokens

[Subagent A] DONE 31s | 16,800 tokens

━━━ LAYER 4: SCORING ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Rubric Eval] rubric_data_analysis.yaml v2.1

relevance 0.88 × 0.35 = 0.308

accuracy 0.95 × 0.30 = 0.285 ✓ SQL validated

completeness 1.00 × 0.20 = 0.200 ✓ all fields

cost_efficiency 0.85 × 0.15 = 0.128 (36k/50k budget)

FINAL = 0.921 ✅ PASS (threshold: 0.75)

━━━ SUMMARY ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Time: 73s (3 subagents parallel)

Tokens: 36,420 / budget 70k (52% used)

Skills: 4/47 loaded (saved ~43k tokens)

Score: 0.921 PASS

Est. cost: ~$0.18 (Haiku + Sonnet mix)

Cost Optimization — Strategy Comparison

Strategy	Tokens used	Cost (est.)	Accuracy
Load-all skills + docs	~2,000,000+	~$6.00	High but wasteful
Tool Search + Vector Retrieval	~70,000	~$0.18	Nearly equivalent
All-Opus every task	70,000	~$2.10	Highest
Model routing (Haiku+Sonnet)	70,000	~$0.18	Nearly equivalent
Routing + Prompt caching	70,000 (40% cached)	~$0.11	Same as routing

💡

Rule of thumb: Haiku ($) for filter/classify, Sonnet ($$) for analyze/write, Opus ($$$) only when deep reasoning is truly needed. Add prompt caching for another 30-40% savings.

Fact-Check Report — Claim-by-Claim Validation

All claims in this document were independently searched and verified from official sources (Anthropic docs, PyPI, GitHub, pricing pages). Checked: 14/05/2026.

✅ PASS Accurate ⚠️ PARTIAL Correct but needs more context 🔧 UPDATE Needs updating ❌ INCORRECT Wrong / Outdated

1. SDK Identity & Package Names

1.1

"Claude Agent SDK (formerly Claude Code SDK)"

✅

Confirmed: Anthropic officially renamed in Sept 2025. Package claude-code-sdk on PyPI is DEPRECATED, redirects to claude-agent-sdk.

1.2

Install: pip install claude-agent-sdk & npm install @anthropic-ai/claude-agent-sdk

✅

Confirmed on PyPI and npmjs. Python requires 3.10+; TypeScript/Node.js requires 18+.

1.3

Python examples use asyncio.run()

⚠️

Correct functionally, but SDK Python uses anyio as async backend. Official README uses anyio.run(main). Both work, but anyio is the canonical approach.

2. Core API & Parameters

2.1

ClaudeCodeOptions → ClaudeAgentOptions

✅

Migration guide confirms: "ClaudeCodeOptions renamed to ClaudeAgentOptions". Breaking change in v0.1.0.

2.2

Parameters: system_prompt, max_turns, allowed_tools, disallowed_tools

✅

All confirmed from official GitHub Python SDK and docs.

2.3

Message types: AssistantMessage, ToolUseMessage, ResultMessage, SystemMessage

⚠️

Correct names. Note: TypeScript SDK also has SDKCompactBoundaryMessage for auto-compaction. Doc doesn't mention this edge case.

2.4

query() vs ClaudeSDKClient — query for single, Client for multi-turn + hooks + custom tools

✅

Confirmed from official docs: query() = new session each time, no hooks. ClaudeSDKClient = persistent session, hooks, custom tools.

3. Built-in Tools

3.1

Built-in tools: Read, Edit/Write, Bash, Glob, Grep, WebFetch, TodoWrite

✅

Confirmed from GitHub README. All tools verified.

3.2

permission_mode="acceptEdits" for automation

✅

Confirmed from quickstart. bypassPermissions also exists for full automation.

4. Custom Tools & MCP

4.1

Decorator @tool + create_sdk_mcp_server() for in-process MCP

✅

Confirmed from GitHub README and official docs. Custom tools run in-process (no separate subprocess needed).

4.2

MCP config: mcp_servers=[{"name": "github", "url": "..."}]

⚠️

Actual API format is a dict: mcp_servers={"server_name": server_config}. List format is from an older API version.

4.3

MCP tool name format: mcp__server_name__tool_name

✅

Confirmed from official custom tools docs.

5. Pricing & Model Names

5.1

Pricing: Haiku $0.80/$4.00, Sonnet $3/$15, Opus $15/$75

🔧

Needs update: May 2026 pricing: Haiku 4.5 = $1.00/$5.00, Sonnet 4.6 = $3.00/$15.00 ✅, Opus 4.6 = $5.00/$25.00 (not $15/$75 — that was old Opus 4.1).

5.2

Batch API: 50% discount

✅

Confirmed: "The Batch API allows asynchronous processing with a 50% discount on both input and output tokens."

5.3

Agent SDK separate credit pool from 15/6/2026

✅

Confirmed from official announcement. From 15/6/2026, claude -p and Agent SDK use separate programmatic credit pool.

6. Architecture Claims

6.1

Tool Search reduces 90%+ context vs load-all tools

✅

Confirmed from Anthropic Cookbook: "cutting context usage by 90%+ while enabling applications that scale to thousands of tools."

6.2

Anthropic recommends agentic search first, add semantic search only when needed

✅

Confirmed from Anthropic engineering blog: "we suggest starting with agentic search."

6.3

Subagents use isolated context windows, only send results to orchestrator

✅

Confirmed: "Subagents use their own isolated context windows, and only send relevant information back to the orchestrator."

6.4

Voyage AI is Anthropic's recommended embedding provider

✅

Confirmed from Anthropic embeddings docs: "Voyage AI" with model voyage-4 as current generation.

Summary

✅ PASS

Accurate, confirmed from official sources

⚠️ PARTIAL

Correct in essence, needs more context

🔧 UPDATE

Old model pricing — update to 2025/2026 rates

❌ INCORRECT

No claims found completely wrong

💡

Action items: (1) Update pricing calculator: Haiku = $1/$5, Opus = $5/$25 (not $15/$75). (2) MCP servers config format: use dict {"name": config}, not a list. (3) strict_mcp_config: verify with current SDK version before using. (4) GitHub MCP URL in code examples is illustrative — not a real working URL.