Claude Agent SDK
A programming toolkit for building autonomous AI agents — no need to write your own tool execution loop.
What's the core idea?
Imagine hiring an AI contractor. Instead of standing over their shoulder issuing instructions step by step, the Agent SDK lets you state a goal once, and the AI figures out the steps, does the work, checks the result on its own. The SDK handles the entire loop for you.
Quick install
# Python
pip install claude-agent-sdk
# TypeScript / Node.js
npm install @anthropic-ai/claude-agent-sdk
# Set your API key
export ANTHROPIC_API_KEY=sk-ant-... import asyncio
from claude_agent_sdk import query, ClaudeAgentOptions
async def main():
async for message in query(
prompt="Find and fix the bug in auth.py",
options=ClaudeAgentOptions(allowed_tools=["Read", "Edit", "Bash"])
):
print(message) # Claude reads → finds bug → edits → done
asyncio.run(main()) SDK vs API Client
Anthropic offers two ways to use Claude. Understanding the difference helps you pick the right tool for your project.
# You must implement this yourself
while response.stop_reason == "tool_use":
result = your_tool_executor(response)
response = client.messages.create(
tool_result=result, ...
)
prompt="Fix the bug"
):
print(msg)
# Claude: read → edit → test → done
| Criteria | Raw API Client | Agent SDK |
|---|---|---|
| Setup complexity | Low — plain API call | Medium — but does far more |
| Tool execution | You implement yourself | Claude handles it all |
| Best for | Chat apps, simple Q&A | Automation, CI/CD, data pipelines |
| When to use | Need per-step control | Need agent to work autonomously |
Pro tip: Many teams use both — the raw API client for interactive chat UIs, and the Agent SDK for background automation pipelines running unattended.
query() — Communicating with the Agent
query() is the central function of the SDK. Send it a task in natural language and receive a stream of messages as the agent works.
Basic syntax
from claude_agent_sdk import query, ClaudeAgentOptions, AssistantMessage, TextBlock
async for message in query(
prompt="Analyze sales.csv and generate a summary report",
options=ClaudeAgentOptions(
system_prompt="You are an expert data analyst.",
max_turns=10, # max 10 action turns
allowed_tools=["Read", "Bash"],
)
):
if isinstance(message, AssistantMessage):
for block in message.content:
if isinstance(block, TextBlock):
print(block.text) Key parameters
| Parameter | Meaning | Example |
|---|---|---|
prompt | Task to send to the agent | "Fix bug in auth.py" |
system_prompt | Define the agent's role | "You are a senior Python backend engineer" |
max_turns | Limit on action turns | 10, 20, 50 |
allowed_tools | Tools the agent may use | ["Read", "Edit", "Bash"] |
disallowed_tools | Block specific tools | ["Bash"] — prevent shell commands |
Message types returned
| Type | Meaning |
|---|---|
AssistantMessage | Claude's reply or reasoning text |
ToolUseMessage | Claude is calling a tool (reading a file, running a command…) |
ResultMessage | Final result after the agent finishes |
SystemMessage | System notification (context compacted, error…) |
Built-in Tools
The SDK ships with tools Claude can use immediately — no custom code needed. These are the agent's "hands".
Safety note: By default Claude will ask for confirmation before editing files or running commands. You can customize this via permission_mode to fully automate once you trust the environment.
Custom Tools & MCP
Beyond built-in tools, you can give the agent your own custom tools or connect it to external services via MCP.
Custom Tool in Python
Define a regular Python function, attach the @tool decorator, and register it with the agent. Claude will know when to call it.
from claude_agent_sdk import tool, create_sdk_mcp_server, ClaudeSDKClient
# Define your own tool
@tool("query_database", "Query the internal database", {"sql": str})
async def run_sql(args):
result = your_db.execute(args["sql"])
return {"content": [{"type": "text", "text": str(result)}]}
# Wrap as an internal MCP server (no separate process needed)
server = create_sdk_mcp_server(
name="data-tools", version="1.0", tools=[run_sql]
)
# Pass it to the agent
client = ClaudeSDKClient(mcp_servers=[server])
await client.send("Analyze Q4 revenue from the database") Connecting an External MCP Server
MCP (Model Context Protocol) is Anthropic's open tool-connection standard. You can connect Slack, GitHub, Notion, external databases with just a few lines of config.
options = ClaudeAgentOptions(
mcp_servers=[
{"name": "github", "url": "https://mcp.github.com/sse"},
{"name": "postgres", "url": "http://localhost:5432/mcp"},
],
strict_mcp_config=True # only use declared servers, ignore global config
) Sessions & Context Management
Agents can work for hours continuously. Sessions help you manage memory, resume work after interruption, and avoid losing context when conversations grow long.
The problem: Context Window has limits
Claude can only "remember" a certain amount of text in a single working session (the context window). For large projects requiring many hours, the SDK automatically compacts context to retain important information without exceeding the limit.
Resume a session after stopping
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions
# Run 1: create session and save the ID
client = ClaudeSDKClient()
await client.send("Start refactoring the authentication module")
session_id = client.session_id # save this
# Run 2: continue from exactly where it left off
client2 = ClaudeSDKClient(options=ClaudeAgentOptions(resume=session_id))
await client2.send("Continue — write unit tests for what you just refactored") Eager session flushing for live UIs
Use session_store_flush="eager" if you need to display output in real-time to a terminal or dashboard — data is pushed immediately instead of waiting until the end of a turn.
Hooks — Controlling Agent Behavior
Hooks let you intercept between agent steps to inspect, block, or modify actions — like a security layer you control.
Pre-tool Hook (before tool use)
async def safety_check(context):
# Block the agent from deleting files
if context.tool_name == "Bash" and "rm -rf" in context.tool_input:
return {"decision": "block", "reason": "File deletion not permitted"}
# Auto-approve safe commands
if context.tool_name in ["Read", "Glob"]:
return {"decision": "approve"}
client = ClaudeSDKClient(pre_tool_use_hook=safety_check) Post-tool Hook (after tool use)
Use to log actions, sanitize output before Claude sees it, or write an audit trail for compliance.
Streaming Output
The agent returns results as a stream — you receive each message the moment Claude finishes it, without waiting for the full task to complete.
Why streaming matters
For complex tasks (refactoring an entire codebase, analyzing a large dataset), the agent may run for several minutes. Streaming lets you display actual progress in a terminal or dashboard rather than staring at a blank screen.
async for message in query(prompt="Run the full test suite and fix failures"):
match message:
case AssistantMessage():
print(f"🤖 Claude: {message.content[0].text}")
case ToolUseMessage():
print(f"🔧 Using tool: {message.name}")
case ResultMessage(is_error=True):
print(f"❌ Error: {message.error_text}")
case ResultMessage():
print(f"✅ Done after {message.num_turns} turns") Structured Output
If you need JSON output (to process further in code), prompt Claude to return JSON and parse it:
result = await get_structured(
prompt="Analyze the file. Return JSON: {errors: [], warnings: [], summary: ''}"
)
data = json.loads(result) Use Case: AI Coding Agent
Apply the Claude Agent SDK to automate the full software development lifecycle — from code review to deployment.
Scenario: CI/CD detects a test failure on GitHub. Instead of an engineer debugging manually, the agent reads the stack trace, finds the root cause, fixes the code, re-runs tests, and opens a PR automatically.
Workflow
A GitHub Action calls the SDK when tests fail, passing in the error log and commit ID.
Uses Read + Grep to understand the context and find files related to the error.
Uses Edit to apply a surgical patch. No deletions, only the failing line is changed.
Uses Bash to run the test suite. If still failing, loops back to step 2 (up to max_turns).
Uses Bash + GitHub CLI to commit and open a Pull Request with a detailed description of the bug and fix.
async def auto_fix_bug(error_log: str, repo_path: str):
async for msg in query(
prompt=f"""
Test suite failed with the following error:
{error_log}
Tasks:
1. Read the stack trace and find the root cause
2. Fix the code (only the broken part)
3. Re-run: pytest tests/
4. If passing: git add -A && git commit -m 'fix: ...'
5. gh pr create --title 'Auto fix' --body 'Fixed by Claude Agent'
""",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Edit", "Bash", "Grep"],
max_turns=20,
system_prompt="You are a senior Python engineer. Fix bugs precisely, no unnecessary refactoring."
)
):
print(msg) Use hooks for safety: Add a pre_tool_use_hook to block dangerous commands like git push --force or rm -rf before the agent accidentally runs them.
SDK features used
| Feature | Role in this project |
|---|---|
query() + allowed_tools | Restrict agent to read/edit code only, nothing out of scope |
max_turns | Prevent infinite loops when a bug is complex |
Hooks | Audit log every file change for later review |
Sessions | Continue a large refactor across multiple runs |
ResultMessage | Know definitively whether the agent succeeded or timed out |
Use Case: Data Analytics Pipeline
Use the Agent SDK to build an automated large-scale data pipeline — from cleaning to visualization — without a data engineer running each step manually.
Scenario: Every morning, the system receives a fresh CSV from the data warehouse (~500k rows). The agent automatically checks data quality, cleans it, calculates KPIs, generates a PDF report with charts, and posts to Slack.
Workflow
Uses Read + Bash to run pandas.describe(), understand structure, detect nulls and outliers.
Agent writes a Python script to handle missing values, normalize formats, and deduplicate. Runs and verifies the result.
Runs aggregation queries: revenue, CAC, churn rate, top products by region. Outputs JSON.
Writes and executes matplotlib/plotly scripts to generate charts. Saves PNG/HTML.
Uses a Custom Tool to call the Slack API with the summary and chart attachments for the team.
async def morning_pipeline(csv_path: str):
async for msg in query(
prompt=f"""
New data file: {csv_path}
Pipeline:
1. Read and summarize (rows, columns, null %)
2. Clean: handle nulls per domain rules, normalize date format
3. Calculate KPIs: total revenue, by region, top 10 products
4. Generate bar chart revenue by region, save to /reports/today.png
5. Call send_slack #data-team with summary and chart path
""",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Write", "Bash"],
max_turns=30,
mcp_servers=[slack_server],
)
):
print(msg) Use Case: Legal Document Review
Automate contract review — find risk clauses, flag unfavorable terms, compare against standard templates. Reduces junior lawyer time by 80%.
Scenario: Company receives 50–200 contracts per month from partners. Previously each took 2–4 hours for a lawyer to review. Agent reads the PDF, cross-references the internal legal playbook, tags each clause by risk level, and generates a memo for senior lawyer review.
Workflow
Uses Read to read the PDF. Extracts sections: Parties, Term, Payment, Liability, IP Ownership, Termination, Governing Law.
Loads legal playbook via Memory/RAG. Agent compares each clause with the standard template, tagging: ACCEPTABLE / NEGOTIABLE / RED_FLAG.
Calculates maximum exposure: penalty clauses, liability cap, indemnification scope. Outputs absolute dollar amounts.
Writes a concise memo for the senior lawyer: executive summary, top 3 risks, recommended redlines, overall risk score.
Calls a Custom Tool to push results to the case management system (Clio / Notion / Jira Legal).
Real-world ROI: At 100 contracts/month × 3h × $150/h junior lawyer = $45,000/month saved. Agent cost estimate: ~$200–400/month.
Use Case: Customer Support Triage
Automatically classify, prioritize, and resolve support tickets — the agent handles 70% of tier-1 tickets without human intervention, escalating complex cases with full context.
Scenario: B2B SaaS receives 500–2000 tickets/day via email/Intercom. 70% are repetitive questions (billing, account, basic how-to). Agent reads each ticket, looks up the knowledge base + account history, replies or escalates with full context for the human agent.
3-tier processing architecture
Receive ticket → classify intent + urgency + sentiment (1–5) + tier: TIER1_AUTO / TIER2_HUMAN / TIER3_URGENT. Cost: ~$0.001/ticket.
For TIER1: pull account data via Custom Tool, find related tickets (vector search), load relevant KB articles. Compose a full reply.
TIER1: send reply directly + close ticket. TIER2/3: escalate with summary — what the agent found and recommended action — human agent understands in 30 seconds.
Real numbers: 1000 tickets/day, TIER1 ~70% = 700 tickets. Cost: Haiku classify 1000 + Sonnet resolve 700 ≈ $8–12/day vs. 5 human agents × $30/day = $150.
Use Case: Security Threat Detection
Agent continuously reads logs, correlates events, detects attack patterns, generates incident reports, and triggers automated response — an intelligent SIEM without rule-based limitations.
Scenario: A 3-person security team manages 200-server infrastructure. Not enough people to read logs 24/7. Agent runs every 15 minutes, reads aggregated logs, finds anomalies, correlates with threat intelligence, automatically blocks suspicious IPs, and generates an incident report for the on-call engineer.
Workflow
Reads log batches from S3/Elasticsearch. Runs a bash normalization pipeline to standardize format from multiple sources (nginx, auth, firewall, app).
Uses Bash to run statistical analysis: IP request rate, failed auth spikes, unusual port activity, impossible geo-velocity travel. Flags outliers with Z-score > 3.
WebFetch checks flagged IPs against VirusTotal / AbuseIPDB. Cross-references with internal blocklist via Custom Tool.
If confidence is high: calls Custom Tool to block IP on the firewall automatically. If uncertain: escalates to human with evidence packet.
Generates report in MITRE ATT&CK framework: Tactic / Technique / Evidence / Impact / Remediation. Sends to PagerDuty + Slack #security.
Always audit with Hooks: Every block_ip call must be logged with timestamp + reasoning. Pre-tool hook should require a minimum confidence score before executing. Human oversight is critical here.
Use Case: Content Production Factory
Multi-channel marketing content pipeline — from a single brief, the agent produces a blog post, 10 social posts, an email newsletter, and SEO metadata, all consistent in voice and brand.
Scenario: A SaaS startup needs 3–4 content pieces/week across blog, LinkedIn, Twitter/X, and email newsletter. One content writer can't keep up. Agent receives a brief from the product team, researches, creates multi-channel drafts, optimizes for SEO, saves to the CMS — the writer only reviews and approves.
Workflow
WebFetch reads the top 5 articles on the topic. Grep through the internal blog archive to avoid duplicates. Builds an outline based on gap analysis.
Creates a 1500–2500 word post with standard SEO structure: H1/H2/H3, internal link suggestions, meta description, target keyword density.
Extracts the 10 best quotes/insights from the blog. Reformats for LinkedIn (professional), Twitter/X (punchy, hooks), and Threads (conversational).
Summarizes blog into a 300-word newsletter with CTA, 3 A/B subject line variants, and preview text.
Custom Tool pushes draft to Contentful/WordPress with correct tags, categories, and scheduled publish date. Notifies the content writer via Slack.
async def content_factory(brief: dict):
# Orchestrator creates the blog first
blog_content = await create_blog(brief)
# 3 subagents run in parallel from the blog content
social, email, seo = await asyncio.gather(
create_social_posts(blog_content, platforms=["linkedin", "twitter", "threads"]),
create_newsletter(blog_content),
optimize_seo(blog_content, target_keyword=brief["keyword"]),
)
return {"blog": blog_content, "social": social, "email": email, "seo": seo} Use Case: Employee Onboarding Agent
New employees get a personal AI agent — it reads the entire wiki, policy docs, and codebase, and answers any onboarding question 24/7 with full context about the company and their specific role.
Scenario: A 200-person tech company hires 5–10 people per month. Each new hire takes 2–4 weeks to become productive. HR spends 40% of their time answering the same questions repeatedly. The agent is personalized by role (engineer/designer/sales), has the company's full knowledge base, and learns from each interaction.
Dual ROI: Saves HR time (answering repetitive questions) + shortens time-to-productive from 4 weeks to 2 weeks. At $5k/employee onboarding cost, 10 hires/month = $25k–30k/month saved.
Architecture Diagrams — How It All Connects
Understand how the Agent SDK, Claude model, tools, and your client interact with each other.
Agent Loop — Overall Diagram
Server Stdout → Client Flow
CLI / Stdout Demo — Real World
See how the agent operates in the terminal — from the invocation to each streamed message.
Method 1: Use the CLI directly (no code)
Method 2: Python SDK with stream output
Method 3: Subprocess — call the agent from any language
Because the agent communicates via JSON lines on stdout, you can call it from Go, Ruby, Java... by spawning a subprocess and reading stdout line by line.
# Run agent non-interactively, output JSON to stdout
claude --print --output-format json-stream \
-p "Check errors in src/ and list affected files" \
| jq '.message' # parse with jq Live Console Animation
Real-time simulation of an agent running. Choose a scenario and click ▶ Run to watch each step appear like a real terminal.
LLM Pricing Calculator
Calculate the real cost of your Agent SDK pipeline. Adjust task volume, model mix, and token usage to estimate monthly cost.
Minor Features
Features rarely needed in basic projects but valuable in production.
Observability & Monitoring
- OpenTelemetry: Integrate trace/span to track agents in monitoring systems (Datadog, Jaeger).
- Cost Tracking: View per-request token cost via
ResultMessage.usage. - Todo Tracking: Agent auto-creates a todo list to stay on track in long tasks (via
TodoWritetool).
File & State Management
- File Checkpointing: Automatically snapshots files before Edit. Can rewind (undo) if agent breaks something.
- Session Store: Saves conversation transcript for resume after crash or step-by-step debugging.
Advanced Customization
- Slash Commands: Define shortcuts like
/reset,/statusto control the agent mid-session. - Skills: Package sets of instructions/tools into a reusable "skill" across multiple projects.
- Plugins: Extension point for platform-level customization.
- Subagents: Main agent spawns child agents that run in parallel (parallelism).
- Tool Search: When you have hundreds of tools, the agent auto-finds the right ones instead of loading all of them.
- Effort Level: Adjust effort from
lowtoxhigh(for Opus models). - Strict MCP Config: Lock the exact list of MCP servers the agent can use, preventing unintended connections.
Multi-cloud Support
- Amazon Bedrock: Use
CLAUDE_CODE_USE_BEDROCK=1to route through AWS instead of Anthropic directly. - Google Vertex AI: Use
CLAUDE_CODE_USE_VERTEX=1for Google Cloud. - Microsoft Azure AI Foundry: Use
CLAUDE_CODE_USE_FOUNDRY=1for Azure.
Deploy & Billing
What you need to know when taking Agent SDK to production.
Billing Notice (from 15/6/2026)
Important change: From 15/6/2026, Agent SDK and claude -p will use a separate credit pool instead of sharing the subscription pool. Programmatic usage is no longer subsidized as before.
| Usage type | Pool | Notes |
|---|---|---|
| Interactive chat (claude.ai) | Subscription pool | Unchanged from before |
Agent SDK / claude -p | Agent SDK credit pool | Fixed monthly credit, then pay API rate |
| GitHub Actions | Agent SDK credit pool | Same as above |
Production best practices
Always add a pre-tool hook blocking dangerous commands when deploying on production servers.
Avoid infinite loops that burn money. For simple tasks use 5-10, complex ones use 20-30.
Log message.usage from ResultMessage to track cost per task.
When deploying, lock the MCP server list. Prevents agents from connecting to tools outside the whitelist.
Managed Agents: If you don't want to manage infrastructure yourself, Anthropic also offers Managed Agents — a hosted REST API where Anthropic handles all sandbox and execution. Suitable for rapid scaling.
Optimal Architecture — Multi-Skill, Multi-Document
When a project has dozens of Skills and thousands of documents, you can't load everything into context at once. This architecture describes how to intelligently layer things so the agent works accurately with minimal token cost.
Core problem: 500 skills + 10,000 documents = millions of tokens if loaded in full. The 200k token context window fills immediately. Solution: only load what's needed, when needed — via embedding search + tool search.
4-Layer Architecture Map
Core Architecture Principles
tool_search to find top-k via embedding. Saves 90% context vs load-all.Scan · Embed · Store — How Agents Read Large Data
When you have thousands of documents and skills, you need a data preparation pipeline before the agent runs. Here's the 3-step process.
1. Scan — Crawl and Chunk
Read + Grep + Glob directly. Higher accuracy than semantic, easier to debug. Use for exact match or codebase search.Anthropic recommends: Start with agentic search. Only add semantic search when you need speed or scale is too large to Read files one by one.
2. Embed — Convert to Vectors
import voyageai
from qdrant_client import QdrantClient
vo = voyageai.Client()
qdrant = QdrantClient(url="http://localhost:6333")
# 1. Chunk documents (500 tokens/chunk, overlap 50)
chunks = chunk_documents(docs, size=500, overlap=50)
# 2. Embed with Voyage AI (Anthropic recommended)
doc_embeddings = vo.embed(
[c.text for c in chunks],
model="voyage-4", input_type="document"
).embeddings
# 3. Store in Qdrant
qdrant.upload_points(
collection_name="project_docs",
points=[{"id": i, "vector": emb,
"payload": {"text": chunks[i].text, "source": chunks[i].source}}
for i, emb in enumerate(doc_embeddings)]
) 3. Memory Store Layers
| Memory Type | Used For | Technology | Token Cost |
|---|---|---|---|
| Short-term | Current context window | SDK context + compaction | High — pay per token |
| Vector store | Document chunks, skill embeddings | Qdrant, pgvector | Low — query-time only |
| Graph store | Entities, relations, cross-doc links | cognee, Neo4j | Very low — traversal |
| File memory | MEMORY.md, progress notes | File system | ~0 — plain text read |
| Session store | Transcript, resume state | SQLite | 0 — offline |
Tool Search for Skills — 90% context savings
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2") # local, free
# Offline: embed all skill descriptions once
skill_texts = [f"{s.name}: {s.description}" for s in all_skills]
skill_vectors = model.encode(skill_texts) # [N_skills, 384]
def search_skills(query: str, top_k: int = 5, min_score: float = 0.70):
q_vec = model.encode([query])[0]
scores = np.dot(skill_vectors, q_vec) # cosine similarity
top_idx = np.argsort(scores)[::-1][:top_k]
return [all_skills[i] for i in top_idx if scores[i] >= min_score] Agent Scoring & Predefined Rubric
How do you know if the agent answered well? A rubric is a pre-defined set of rules — the agent "takes the test" against the rubric, it doesn't grade itself.
What is a Rubric?
Just like a teacher grades papers against an answer key, a Rubric is a set of criteria written in advance by domain experts. The agent doesn't know the rubric exists — it just does the work, while the evaluator grades using the rubric.
- Agent self-grades its own output → bias
- Different score every run → inconsistent
- Don't know exactly which part is weak to improve
- No objective threshold to decide retry vs escalate
- Third party (evaluator) grades using fixed criteria
- 100% consistent — same output always gives same score
- Knows exactly which dimension failed (relevance? accuracy?)
- Automatic: PASS → deliver, RETRY → re-run, ESCALATE → human
Scoring Pipeline Architecture
Rubric YAML — Defining the Scale
# Rubric for data analysis task
task_type: data_analysis
version: "2.1"
dimensions:
relevance:
weight: 0.35
description: "Does the result actually answer the question?"
scoring:
1.0: "Answers directly, correct metric asked for"
0.7: "Answered but missing some dimensions"
0.4: "Partially off-topic"
accuracy:
weight: 0.30
validators:
- type: cross_check_sql
- type: range_sanity # revenue cannot be negative
completeness:
weight: 0.20
required_fields: [summary, kpis, anomalies, recommendation]
cost_efficiency:
weight: 0.15
max_tokens_per_task: 50000
penalty_over_budget: 0.5
thresholds:
pass: 0.75
warn: 0.60
retry: 0.45
escalate: 0.30
retry_strategy:
max_retries: 2
on_fail_dimension: relevance # re-prompt focused on failing dimension Scoring Methods — Implementation Tabs
Each rubric dimension is scored using a different method. No single method is "best" — what matters is matching the dimension's nature.
Relevance can't be checked with pure code — it needs semantic understanding. We use Haiku (cheap, fast) as a "sub-judge" — prompt it to score against rubric criteria.
import anthropic
async def score_relevance(question: str, agent_output: str, criteria: dict) -> float:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=100,
messages=[{
"role": "user",
"content": f"""
Original question: {question}
Agent output:
{agent_output[:2000]}
Score Relevance (0.0 - 1.0) on this scale:
1.0 = {criteria['scoring'][1.0]}
0.7 = {criteria['scoring'][0.7]}
0.4 = {criteria['scoring'][0.4]}
0.0 = Completely irrelevant
Return only a single decimal number, no explanation.
"""
}]
)
try:
score = float(response.content[0].text.strip())
return max(0.0, min(1.0, score))
except:
return 0.5 # fallback on parse error Cost: Haiku ~$0.00025/call here. At 500 tasks/day, relevance scoring costs only ~$0.12/day.
Accuracy uses pure code — no AI needed. Results are 100% objective, runs in ms instead of seconds, no token cost.
def validate_range_sanity(output: dict) -> float:
violations = 0
checks = 0
for key in ["revenue", "gmv", "arr"]:
if key in output.get("kpis", {}):
checks += 1
if output["kpis"][key] < 0:
violations += 1
if checks == 0: return 1.0
return 1.0 - (violations / checks)
def validate_cross_check_sql(output: dict, source_db) -> float:
try:
sql_result = source_db.query(
"SELECT SUM(revenue) FROM orders WHERE date >= %s",
[output["period_start"]]
)
expected = sql_result[0][0]
reported = output["kpis"]["revenue"]
diff_pct = abs(reported - expected) / expected
if diff_pct < 0.001: return 1.0
elif diff_pct < 0.01: return 0.85
elif diff_pct < 0.05: return 0.6
else: return 0.2
except:
return 0.5 Completeness simply checks: are the required fields present in the output? No AI, no SQL — pure dictionary check.
def score_completeness(output: dict, required_fields: list) -> float:
found = 0
for field in required_fields:
parts = field.split(".")
val = output
try:
for p in parts: val = val[p]
has_value = val is not None and val != "" and val != []
if has_value: found += 1
except KeyError:
pass
return found / len(required_fields) Cost Efficiency measures whether the agent consumed tokens within the allowed budget. Taken from ResultMessage.usage in the SDK.
def score_cost_efficiency(result_message, max_tokens: int, penalty: float = 0.5) -> float:
usage = result_message.usage
total = usage.input_tokens + usage.output_tokens
if total <= max_tokens:
usage_pct = total / max_tokens
return 1.0 - (usage_pct * 0.15)
else:
overrun_pct = (total - max_tokens) / max_tokens
base_score = 1.0 * penalty
return max(0.1, base_score - overrun_pct * 0.2) Combine all dimensions into a single score via weighted sum. Then compare against thresholds to produce a verdict.
async def evaluate_with_rubric(question, agent_output, result_msg, rubric_path) -> RubricResult:
rubric = yaml.safe_load(open(rubric_path))
dims = rubric["dimensions"]
scores = {}
scores["relevance"] = await score_relevance(question, str(agent_output), dims["relevance"])
scores["accuracy"] = score_accuracy(agent_output, dims["accuracy"]["validators"])
scores["completeness"] = score_completeness(agent_output, dims["completeness"]["required_fields"])
scores["cost_efficiency"] = score_cost_efficiency(result_msg, dims["cost_efficiency"]["max_tokens_per_task"])
final = sum(scores[d] * dims[d]["weight"] for d in scores)
t = rubric["thresholds"]
verdict = ("PASS" if final >= t["pass"] else
"WARN" if final >= t["warn"] else
"RETRY" if final >= t["retry"] else "ESCALATE")
failed = [d for d, s in scores.items() if s < 0.6]
return RubricResult(scores, final, verdict, failed) Interactive Scoring Calculator
Enter scores for each dimension to see how the weighted sum and verdict work in real time.
Radar Chart — Score by Dimension
Console Demo — Full End-to-End Run
Watch the entire pipeline run in a terminal: from receiving the prompt, scanning documents, selecting skills, spawning subagents, to scoring the result.
Step 1 — Setup & Index Data
Step 2 — Run Pipeline with Real Prompt
Cost Optimization — Strategy Comparison
| Strategy | Tokens used | Cost (est.) | Accuracy |
|---|---|---|---|
| Load-all skills + docs | ~2,000,000+ | ~$6.00 | High but wasteful |
| Tool Search + Vector Retrieval | ~70,000 | ~$0.18 | Nearly equivalent |
| All-Opus every task | 70,000 | ~$2.10 | Highest |
| Model routing (Haiku+Sonnet) | 70,000 | ~$0.18 | Nearly equivalent |
| Routing + Prompt caching | 70,000 (40% cached) | ~$0.11 | Same as routing |
Rule of thumb: Haiku ($) for filter/classify, Sonnet ($$) for analyze/write, Opus ($$$) only when deep reasoning is truly needed. Add prompt caching for another 30-40% savings.
Fact-Check Report — Claim-by-Claim Validation
All claims in this document were independently searched and verified from official sources (Anthropic docs, PyPI, GitHub, pricing pages). Checked: 14/05/2026.
1. SDK Identity & Package Names
claude-code-sdk on PyPI is DEPRECATED, redirects to claude-agent-sdk.pip install claude-agent-sdk & npm install @anthropic-ai/claude-agent-sdkasyncio.run()anyio.run(main). Both work, but anyio is the canonical approach.2. Core API & Parameters
ClaudeCodeOptions → ClaudeAgentOptionssystem_prompt, max_turns, allowed_tools, disallowed_toolsAssistantMessage, ToolUseMessage, ResultMessage, SystemMessageSDKCompactBoundaryMessage for auto-compaction. Doc doesn't mention this edge case.query() vs ClaudeSDKClient — query for single, Client for multi-turn + hooks + custom tools3. Built-in Tools
permission_mode="acceptEdits" for automationbypassPermissions also exists for full automation.4. Custom Tools & MCP
@tool + create_sdk_mcp_server() for in-process MCPmcp_servers=[{"name": "github", "url": "..."}]mcp_servers={"server_name": server_config}. List format is from an older API version.mcp__server_name__tool_name5. Pricing & Model Names
$0.80/$4.00, Sonnet $3/$15, Opus $15/$75claude -p and Agent SDK use separate programmatic credit pool.6. Architecture Claims
voyage-4 as current generation.Summary
Action items:
(1) Update pricing calculator: Haiku = $1/$5, Opus = $5/$25 (not $15/$75).
(2) MCP servers config format: use dict {"name": config}, not a list.
(3) strict_mcp_config: verify with current SDK version before using.
(4) GitHub MCP URL in code examples is illustrative — not a real working URL.