This lesson includes a graded coding exercise that runs in your browser, unlocked with lifetime access.
Most AI startups do not die from bad models. They die from bad unit economics. A single GPT-4o call costs fractions of a cent. Ten thousand users making ten calls per day costs
50 in input tokens alone -- before you charge a single dollar. The companies that survive are the ones that treat every API call as a financial transaction, not a function call.
Implement semantic caching that serves repeated or similar queries from cache instead of making a new API call
Calculate per-request costs across providers and implement token-aware rate limiting and budget alerts
Build a cost optimization layer with prompt compression, model routing (expensive vs cheap), and response caching
Design a tiered caching strategy using exact match, semantic similarity, and prefix caching for different query types
The Problem
You build a RAG chatbot. It works beautifully. Users love it.
Then the invoice arrives.
GPT-5 costs $5 per million input tokens and
5 per million output. Claude Opus 4.7 costs
5 input / $75 output. Gemini 3 Pro costs
.25 input / $5 output. GPT-5-mini is $0.25/
. Prices below are illustrative; always check the provider's current pricing page.
Here is the math that kills startups:
10,000 daily active users
10 queries per user per day
1,000 input tokens per query (system prompt + context + user message)
500 output tokens per response
Daily input cost: 10,000 x 10 x 1,000 / 1,000,000 x
.50 =
50/day
Daily output cost: 10,000 x 10 x 500 / 1,000,000 x
0.00 = $500/dayMonthly total:
2,500/month
That is just the LLM. Add embeddings, vector database hosting, infrastructure. You are looking at $30,000/month for a chatbot.
The brutal part: 40-60% of those queries are near-duplicates. Users ask the same questions in slightly different words. Your system prompt -- identical across every request -- gets billed every single time. Context documents retrieved by RAG repeat across users who ask about the same topic.
You are paying full price for redundant computation.
The Concept
The Cost Anatomy of an LLM Call
Every API call has five cost components.
graph LR
A[User Query] --> B[System Prompt<br/>500-2000 tokens]
A --> C[Retrieved Context<br/>500-4000 tokens]
A --> D[User Message<br/>50-500 tokens]
B --> E[Input Cost<br/>
.50/1M tokens]
C --> E
D --> E
E --> F[Model Processing]
F --> G[Output Cost<br/>
0.00/1M tokens]
System prompts are the silent killer. A 1,500-token system prompt sent with every request costs $3.75 per million requests just for that prefix. At 100K requests per day, that is $375/day --
1,250/month -- for text that never changes.
Provider Caching: Built-in Discounts
All three major providers offer provider-side prompt caching in 2026, but the mechanics differ. See Phase 11 · 15 for the deep dive.
Provider
Mechanism
Discount
Minimum
Cache Duration
Anthropic
Explicit cache_control markers
90% on cache hits (pay 25% extra on write)
1,024 tokens (Sonnet/Opus), 2,048 (Haiku)
5 min default; 1h extended (2x write premium)
OpenAI
Automatic prefix matching
50% on cache hits
1,024 tokens
Best-effort up to 1 hour
Google Gemini
Explicit CachedContent API
~75% reduction (plus storage)
4,096 (Flash) / 32,768 (Pro)
User-configurable TTL
Anthropic's approach is explicit. You mark sections of your prompt with cache_control: {"type": "ephemeral"}. The first request pays a 25% write premium. Subsequent requests with the same prefix get a 90% discount. A 2,000-token system prompt that costs $0.005 normally costs $0.000625 on cache hits. Over 100K requests, that saves $437.50/day.
OpenAI's approach is automatic. Any prompt prefix that matches a previous request gets a 50% discount. No markers needed. The tradeoff: less discount, less control, but zero implementation effort.
Semantic Caching: Your Custom Layer
Provider caching only works for identical prefixes. Semantic caching handles the harder case: different queries with the same meaning.
"What is the return policy?" and "How do I return an item?" are different strings but identical intent. A semantic cache embeds both queries, computes cosine similarity, and returns the cached response if similarity exceeds a threshold (typically 0.92-0.95).
flowchart TD
A[User Query] --> B[Embed Query]
B --> C{Similar query<br/>in cache?}
C -->|sim > 0.95| D[Return Cached Response]
C -->|sim < 0.95| E[Call LLM API]
E --> F[Cache Response<br/>with Embedding]
F --> G[Return Response]
D --> G
The embedding costs are negligible. OpenAI's text-embedding-3-small costs $0.02 per million tokens. Checking the cache costs almost nothing compared to a full LLM call.
Exact Caching: Hash and Match
For deterministic calls (temperature=0, same model, same prompt), exact caching is simpler and faster. Hash the full prompt, check the cache, return if found.
This works perfectly for:
System prompt + fixed context + identical user queries
Function calling with identical tool definitions
Batch processing where the same document gets processed multiple times
Rate Limiting: Protecting Your Budget
Rate limiting is not just about fairness. It is about survival.
Token bucket algorithm: each user gets a bucket of N tokens that refills at rate R per second. A request consumes tokens from the bucket. If the bucket is empty, the request is rejected. This allows bursts (use the full bucket at once) while enforcing an average rate.
Per-user quotas: set daily/monthly token limits per user tier.
Tier
Daily Token Limit
Max Requests/min
Model Access
Free
50,000
10
GPT-4o-mini only
Pro
500,000
60
GPT-4o, Claude Sonnet
Enterprise
5,000,000
300
All models
Model Routing: Right Model for the Right Job
Not every query needs GPT-4o.
"What time does the store close?" does not require a
0/M-output model. GPT-4o-mini at $0.60/M output handles it perfectly. Claude Haiku at
.25/M output handles it. A simple classifier routes cheap queries to cheap models and complex queries to expensive models.
flowchart TD
A[User Query] --> B[Complexity Classifier]
B -->|Simple: lookup, FAQ| C[GPT-4o-mini<br/>$0.15/$0.60 per 1M]
B -->|Medium: analysis, summary| D[Claude Sonnet<br/>$3.00/
5.00 per 1M]
B -->|Complex: reasoning, code| E[GPT-4o / Claude Opus<br/>
.50/
0.00+]
A well-tuned router saves 40-70% on model costs alone.
Cost Tracking: Know Where the Money Goes
You cannot optimize what you do not measure. Log every API call with:
Timestamp
Model name
Input tokens
Output tokens
Latency (ms)
Computed cost ($)
User ID
Cache hit/miss
Request category
This data reveals which features are expensive, which users are heavy consumers, and where caching has the most impact.
Batching: Bulk Discounts
OpenAI's Batch API processes requests asynchronously at a 50% discount. You submit a batch of up to 50,000 requests, and results come back within 24 hours.
Use batching for:
Nightly document processing
Bulk classification
Evaluation runs
Data enrichment pipelines
Not for: real-time user-facing queries (latency matters).
Budget Alerts and Circuit Breakers
A circuit breaker stops spending when you hit a limit. Without one, a bug or abuse can burn through your monthly budget in hours.
Set three thresholds:
Warning (70% of budget): send an alert
Throttle (85% of budget): switch to cheaper models only
Stop (95% of budget): reject new requests, return cached responses only
The Optimization Stack
Apply these techniques in order. Each layer compounds on the previous ones.
Layer
Technique
Typical Savings
Implementation Effort
1
Provider prompt caching
30-50%
Low (add cache markers)
2
Exact caching
10-20%
Low (hash + dict)
3
Semantic caching
15-30%
Medium (embeddings + similarity)
4
Model routing
40-70%
Medium (classifier)
5
Rate limiting
Budget protection
Low (token bucket)
6
Prompt compression
10-30%
Medium (rewrite prompts)
7
Batching
50% on eligible
Low (batch API)
A RAG app applying layers 1-5 typically reduces costs from
2,500/month to $4,000-6,000/month. That is the difference between burning runway and building a business.
Real Savings: Before and After
Here is a real breakdown for a RAG chatbot serving 10,000 DAU.
Metric
Before Optimization
After Optimization
Savings
Monthly LLM cost
2,500
$5,200
77%
Avg cost per query
$0.0075
$0.0017
77%
Cache hit rate
0%
52%
--
Queries routed to mini
0%
65%
--
P95 latency
2,800ms
900ms (cache hits: 50ms)
68%
Monthly embedding cost
$0
80
(new cost)
Total monthly cost
2,500
$5,380
76%
The embedding cost for semantic caching (
80/month) pays for itself within the first hour of cache hits.
Build It
Step 1: Cost Calculator
Build a token cost calculator that knows current pricing for major models.
The first call writes to the cache (25% premium). Every subsequent call with the same system prompt prefix reads from the cache (90% discount). The cache lasts 5 minutes and resets the timer on every hit.
OpenAI Automatic Caching
# from openai import OpenAI
#
# client = OpenAI()
#
# response = client.chat.completions.create(
# model="gpt-4o",
# messages=[
# {"role": "system", "content": "You are a helpful customer support agent..."},
# {"role": "user", "content": "What is the return policy?"},
# ],
# )
#
# print(f"Prompt tokens: {response.usage.prompt_tokens}")
# print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
# print(f"Completion tokens: {response.usage.completion_tokens}")
OpenAI caches automatically. Any prompt prefix of 1,024+ tokens that matches a recent request gets a 50% discount. No code changes needed -- just check prompt_tokens_details.cached_tokens in the response to verify it is working.
OpenAI Batch API
# import json
# from openai import OpenAI
#
# client = OpenAI()
#
# requests = []
# for i, query in enumerate(queries):
# requests.append({
# "custom_id": f"request-{i}",
# "method": "POST",
# "url": "/v1/chat/completions",
# "body": {
# "model": "gpt-4o-mini",
# "messages": [{"role": "user", "content": query}],
# },
# })
#
# with open("batch_input.jsonl", "w") as f:
# for r in requests:
# f.write(json.dumps(r) + "\n")
#
# batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
# batch = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h")
# print(f"Batch ID: {batch.id}, Status: {batch.status}")
Batch API gives a flat 50% discount on all tokens. Results arrive within 24 hours. Perfect for non-real-time workloads: evaluations, data labeling, bulk summarization.
In production, replace the linear scan with a vector index (Redis Vector Search, Pinecone, or pgvector). Linear scan works for <1,000 entries. Beyond that, use ANN (approximate nearest neighbor) for O(log n) lookup.
Ship It
This lesson produces outputs/prompt-cost-optimizer.md -- a reusable prompt that analyzes your LLM application and recommends specific cost optimizations with projected savings.
It also produces outputs/skill-cost-patterns.md -- a decision framework for choosing the right caching strategy, rate limiting configuration, and model routing rules for your use case.
Exercises
Implement LRU eviction for the semantic cache. Replace the oldest-first eviction with least-recently-used. Track the last access time for each entry and evict the entry with the oldest access time when the cache is full. Compare hit rates between the two strategies over 100 queries.
Build a cost projection tool. Given a log of API calls (the CostTracker logs), project the monthly cost based on the trailing 7-day average. Account for weekday/weekend patterns. Trigger an alert if the projected monthly cost exceeds the budget by more than 20%.
Implement tiered semantic caching. Use two similarity thresholds: 0.98 for high-confidence hits (return immediately) and 0.90 for medium-confidence hits (return with a disclaimer: "Based on a similar previous question..."). Track which tier each hit came from and measure user satisfaction differences.
Build a model routing classifier. Replace the keyword-based classifier with an embedding-based one. Embed 50 labeled queries (simple/medium/complex), then classify new queries by finding the nearest labeled example. Measure classification accuracy against a test set of 20 queries.
Implement a circuit breaker with degradation levels. At 70% budget, log a warning. At 85%, automatically switch all routing to the cheapest model (gpt-4o-mini). At 95%, serve only cached responses and reject new queries. Test by simulating 1,000 requests against a
.00 budget and verify each threshold triggers correctly.
Key Terms
Term
What people say
What it actually means
Prompt caching
"Cache the system prompt"
Provider-level caching where repeated prompt prefixes get a discount (90% Anthropic, 50% OpenAI) -- no code changes for OpenAI, explicit markers for Anthropic
Semantic caching
"Smart caching"
Embedding the query, computing similarity to past queries, and returning the cached response if similarity exceeds a threshold -- catches paraphrases that exact matching misses
Exact caching
"Hash caching"
Hashing the full prompt (model + messages + temperature) and returning the cached response for identical inputs -- only works for temperature=0 deterministic calls
Token bucket
"Rate limiter"
An algorithm where each user has a bucket of N tokens that refills at rate R per second -- allows bursts up to N while enforcing an average rate of R
Model routing
"Cheapskate routing"
Using a classifier to send simple queries to cheap models (GPT-4o-mini, Haiku) and complex queries to expensive models (GPT-4o, Opus) -- saves 40-70% on model costs
Cost tracking
"Metering"
Logging every API call with model, tokens, latency, cost, and user ID so you know exactly where money goes and which features are expensive
Circuit breaker
"Kill switch"
Automatically degrading service (cheaper models, cached-only) or stopping requests entirely when spending approaches the budget limit
Batch API
"Bulk discount"
OpenAI's asynchronous processing at 50% discount -- submit up to 50,000 requests, get results within 24 hours
Prompt compression
"Token diet"
Rewriting system prompts and context to use fewer tokens while preserving meaning -- shorter prompts cost less and often perform better
Cache hit rate
"Cache efficiency"
The percentage of requests served from cache instead of calling the LLM -- 40-60% is typical for production chatbots, saves proportionally on cost
Further Reading
Anthropic Prompt Caching Guide -- the official docs for Anthropic's explicit cache_control markers, pricing, and cache lifetime behavior
OpenAI Prompt Caching -- OpenAI's automatic caching, how to verify cache hits via usage fields, and minimum prefix lengths
OpenAI Batch API -- 50% discount for asynchronous processing, JSONL format, 24-hour completion window, and 50K request limits