AI is the team, not a chatbot button
Prompt caching that drops costs by an order of magnitude, tool use that turns Claude into an operator, MCP servers that connect the model to your stack, and the audit trail that lets you ship it past compliance. The operating system, not the marketing slide.
Why this stack
Prompt caching is the unlock most teams miss
A stable system prompt and a long reference document cache after the first request. Subsequent requests pay roughly ten percent of the input-token cost. We design the prompt structure so caching applies on every meaningful call, which moves a Claude integration from "expensive feature" to "always on".
Tool use turns Claude into a teammate
A tool is a typed function the model can call. We define tools as TypeScript schemas, the model picks one, our handler runs it, and the result goes back into the conversation. The pattern works for reading the database, running a search, calling a Stripe operation, or invoking another agent.
MCP servers connect the model to your stack
Model Context Protocol is how an agent talks to your application code without REST scaffolding. We have shipped MCP servers for Design System Ops, Dev Ops and Social Ops; the same pattern applies to your internal tools. The model uses your code, your code does not parse the model.
Streaming, but the right way
SSE-based streaming with proper backpressure handling, partial-message rendering, error boundaries that do not collapse the UI when the model errors mid-stream, and structured event parsing so a tool call inside a stream is not a parse hack.
Cost accounting and governance from day one
Every Claude request is logged with model, input tokens, output tokens, cache hit ratio and cost. Per-tenant rate limits, per-user budgets, and a finance dashboard that does not require an export-to-spreadsheet step.
What we build with it
Typed Anthropic SDK integration
TypeScript client with retries, rate-limit-aware backoff, structured error handling, model selection per task (Opus / Sonnet / Haiku).
Prompt caching architecture
Cache breakpoints placed on the stable parts (system prompt, reference docs, long examples), measurable cache hit ratio in the dashboard.
Tool use framework
Tools defined as TypeScript schemas with Zod validation. The handler chain is transactional with the database so a model-triggered write rolls back on validation failure.
MCP server for your stack
A Model Context Protocol server that exposes your application capabilities to Claude (read DB, write back, call APIs, trigger workflows). Versioned, governed, audit-logged.
Streaming UI
Server-Sent Events transport, partial-message rendering, tool-call inline rendering, recoverable error boundaries.
Agent orchestration
Multi-step agents that plan, execute tools, observe results and decide what to do next. Bounded loops with a maximum step count and explicit exit conditions.
Conversation persistence
Database-backed conversation store with summarisation for long threads, token-budget-aware context window management.
Token accounting + cost reporting
Per-request logging of model, tokens, cost, cache hit ratio. Per-tenant aggregates, per-user budgets, monthly reporting wired to the admin dashboard.
Prompt injection mitigation
System-prompt isolation, untrusted-input fencing, output validation against expected schema, refusal handling for adversarial instructions.
Output validation pipeline
Structured outputs validated against Zod schemas, re-prompting on parse failure, dead-letter logging for irrecoverable cases.
Model selection policy
Haiku for high-volume cheap tasks, Sonnet for the standard tier, Opus for hard reasoning. The policy lives in code, not in someone's head.
Fallback strategy
Graceful degradation when Anthropic is unavailable, retry with backoff, optional cross-provider fallback where the use case allows it.
A typed Claude call with prompt caching, tool use and streaming
One function does the whole flow. The system prompt and a reference document are cached on the first request. The model can call a `lookupCustomer` tool whose result feeds back into the conversation. The response streams to the client over SSE.
Most "Claude integrations" wire a chatbot input to the API and call it a day. Production AI looks different. The function below is what we ship: a typed call to messages.create, system prompt and reference document cached, a lookupCustomer tool that the model can call, the response streamed back to the client over Server-Sent Events.
1. The typed client
The Anthropic SDK is wrapped in a thin client that adds retries, error mapping and a typed model union. Model selection happens at the call site; the client does not pick.
// src/lib/claude/client.ts
import Anthropic from '@anthropic-ai/sdk'
export const claude = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY!,
maxRetries: 3,
})
export type ClaudeModel =
| 'claude-opus-4-7'
| 'claude-sonnet-4-6'
| 'claude-haiku-4-5'
export const MODEL_BY_TASK: Record<'reasoning' | 'standard' | 'cheap', ClaudeModel> = {
reasoning: 'claude-opus-4-7',
standard: 'claude-sonnet-4-6',
cheap: 'claude-haiku-4-5',
}
2. The tool definition
A tool is a typed function with a JSON schema. Zod produces the schema; the handler runs in the application's normal transactional flow. The model never touches the database directly.
// src/lib/claude/tools/lookup-customer.ts
import { z } from 'zod'
import { zodToJsonSchema } from 'zod-to-json-schema'
import { adminClient } from '@/lib/supabase/admin'
export const lookupCustomerInput = z.object({
email: z.string().email(),
})
export const lookupCustomerTool = {
name: 'lookupCustomer',
description:
'Find a customer by email. Returns the canonical record or null.',
input_schema: zodToJsonSchema(lookupCustomerInput, {
target: 'jsonSchema7',
}) as Record<string, unknown>,
} as const
export async function runLookupCustomer(
input: z.infer<typeof lookupCustomerInput>,
): Promise<unknown> {
const { data, error } = await adminClient
.from('customers')
.select('id, name, plan, status, created_at')
.eq('email', input.email)
.maybeSingle()
if (error) throw error
return data ?? null
}
3. The streaming call with prompt caching
The system prompt and the reference document both carry cache_control. After the first request, subsequent calls pay the cache-read rate (roughly ten percent of standard input) on those blocks. The conversation messages themselves stay uncached because they change every turn.
// src/lib/claude/run.ts
import { claude, MODEL_BY_TASK } from './client'
import { lookupCustomerTool, runLookupCustomer, lookupCustomerInput } from './tools/lookup-customer'
import { systemPrompt } from './prompts/system'
import { referenceDoc } from './prompts/reference'
interface RunInput {
conversation: { role: 'user' | 'assistant'; content: string }[]
tenantId: string
}
export async function* runClaude(input: RunInput) {
const stream = await claude.messages.create({
model: MODEL_BY_TASK.standard,
max_tokens: 4096,
system: [
{
type: 'text',
text: systemPrompt,
cache_control: { type: 'ephemeral' },
},
{
type: 'text',
text: referenceDoc,
cache_control: { type: 'ephemeral' },
},
],
messages: input.conversation,
tools: [lookupCustomerTool],
stream: true,
})
for await (const chunk of stream) {
yield chunk
// When the model decides to call a tool, the stream emits a
// content_block_stop with the tool_use payload. We execute the tool and
// continue the conversation with the result.
if (
chunk.type === 'content_block_stop' &&
'content_block' in chunk &&
chunk.content_block?.type === 'tool_use' &&
chunk.content_block.name === 'lookupCustomer'
) {
const parsed = lookupCustomerInput.parse(chunk.content_block.input)
const result = await runLookupCustomer(parsed)
yield {
type: 'tool_result' as const,
tool_use_id: chunk.content_block.id,
content: JSON.stringify(result),
}
}
}
}
4. The Server-Sent Events route
The streaming generator is piped through a Server-Sent Events transport so the browser receives partial tokens as they arrive. The route handler also writes one row to the request log per call, so cost accounting and audit are not optional.
// app/api/chat/route.ts
import { runClaude } from '@/lib/claude/run'
import { logClaudeRequest } from '@/lib/claude/logging'
import { getServerSession } from '@/lib/auth/server'
export const runtime = 'nodejs'
export async function POST(request: Request) {
const session = await getServerSession()
if (!session) return new Response('unauthorised', { status: 401 })
const { conversation } = await request.json()
const start = Date.now()
const stream = new ReadableStream({
async start(controller) {
let inputTokens = 0
let outputTokens = 0
let cacheHits = 0
try {
for await (const chunk of runClaude({
conversation,
tenantId: session.tenantId,
})) {
if (chunk.type === 'message_start') {
inputTokens = chunk.message.usage.input_tokens
cacheHits = chunk.message.usage.cache_read_input_tokens ?? 0
}
if (chunk.type === 'message_delta') {
outputTokens = chunk.usage.output_tokens
}
controller.enqueue(
new TextEncoder().encode(`data: ${JSON.stringify(chunk)}\n\n`),
)
}
} finally {
controller.close()
await logClaudeRequest({
tenantId: session.tenantId,
userId: session.userId,
model: 'claude-sonnet-4-6',
inputTokens,
outputTokens,
cacheHits,
durationMs: Date.now() - start,
})
}
},
})
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
},
})
}
5. What this buys you
Prompt caching turns a five-cent call into a half-cent call. Tool use turns the model from a chat surface into an operator that reads your database, calls your APIs, runs your workflows. The audit log gives finance a number to plan around and security a record to sign off on. The whole thing fits in four files because the SDK is well designed and the application stops fighting it.
This is what "AI is the team" means in code, not in the marketing slide.
Frequently asked questions
Why Claude over OpenAI or Google?
For coding, structured reasoning and tool-heavy agents, Claude consistently wins our internal benchmarks at the price point. The Anthropic SDK has the most mature support for prompt caching and tool use today. For specific use cases (image gen, audio) we use the right tool for the job; Claude is the default.
How much can prompt caching actually save?
On a typical agent workflow with a 5,000-token system prompt and a 20,000-token reference doc, we see input cost drop by roughly 80 to 90 percent once caching warms up. The exact number depends on cache hit ratio and message turn structure; we measure it per project and report it monthly.
What is the difference between tool use and function calling?
Same idea, different wording. Anthropic calls it tool use, OpenAI calls it function calling. The patterns are interchangeable conceptually; we use Anthropic's structured tool format which gives us typed parameters via JSON Schema and clean stop-reason handling.
Do you build MCP servers?
Yes, we have shipped them for our own internal Ops teams (Design System Ops, Dev Ops, Social Ops). For client projects the pattern is the same: pick the operations Claude needs to call, expose them via MCP, version the schema, governance and audit.
How do you handle cost optimisation?
Three levers, in order. First, prompt caching: the biggest single win, on every project. Second, model selection: Haiku for cheap tasks, Sonnet for the standard tier, Opus for the hard cases. Third, context trimming: summarise old turns, drop stale tool results, keep the working set within budget.
What about safety and prompt injection?
System-prompt isolation by Anthropic's API design, untrusted-input fencing in code (we never concatenate user input into the system prompt), output validation against a Zod schema, and refusal handling for adversarial instructions. Logged and auditable for every request.
Anthropic API direct, Bedrock, or Vertex?
Direct Anthropic API for most projects (fastest model release cadence, full feature parity). Bedrock when the buyer requires AWS data residency, Vertex when the buyer is on Google Cloud. The application code stays portable; only the client construction changes.
Tell us what you are building with Claude
A scoping call, a concrete number in the first reply, no agency theater. A production Claude integration in two to four weeks.