,000/night. You hear about batch APIs.
The batch gets you 50% off. You also enable prompt caching on the system prompt (shared across all 50k calls). Stacked, the bill drops to
80/night — ~9% of baseline. Same pipeline, three config changes.
Batch is the cheapest lever in the LLM cost toolkit that nobody pulls. The reason is mostly organizational: teams think "real-time" when the SLA actually is "by morning." This lesson is about not leaving 90% of the bill on the table.
The Concept
The three batch APIs
OpenAI Batch API: JSONL file upload with a list of requests. Promised 24-hour turnaround (usually ~2-8 hours in practice). 50% discount on input and output tokens. /v1/batches endpoint. Cache-eligible inputs also get cached-input pricing on top.
Anthropic Message Batches: JSONL upload. 24-hour turnaround. 50% discount. Supports cache_control — cache writes are explicit, reads happen automatically within the batch.
Google Vertex AI Batch Prediction: BigQuery or GCS input. Similar 50% discount for Gemini. Integrates with Vertex pipelines.
Semantic: asynchronous, not slow
Batch is "I promise to return within 24 hours" — not "this will take 24 hours." Typical P50 is 2-6 hours. Provider schedules your batch during off-peak windows when GPU inventory is underutilized.
Stack with caching
A 50k-document summarization with the same 4K-token system prompt:
- Synchronous uncached: 50000 × ($input × 4000 + $output × 200) at full rates.
- Synchronous cached: system prompt cached after first write; remaining 49999 get 10x cheaper input.
- Batch cached: all of the above plus 50% discount on both read and write.
The stack: batch + cache = ~10% of sync uncached bill. Any workload that runs overnight and has a shared system prompt should use this.
Workload triage
Interactive — user waits for the response. TTFT matters. Synchronous call with prompt caching. Cannot batch.
Semi-interactive — user submits a task, checks back in minutes. Async queue with fallback to sync if batch not available. Think moderate-volume RAG indexing.
Batch — user expects results "by morning" or "next hour." Content pipelines, classification at scale, offline analysis. Always batch, always stack caching.
Common mistake: classifying everything as interactive because the pipeline is production. Production is not a latency spec — SLA is.
The partial-interactivity trap
Some features look interactive but tolerate 5-10 minutes. Example: a nightly customer health report with "refresh" button. User clicks refresh; wait 10 minutes is fine. Team ships it as synchronous. 50 concurrent refreshes cost 10x what batched-and-delivered-via-email would cost.
The question to ask: "What does 24-hour mean for this user?" If the answer is "they wouldn't notice," batch it.
The output-schema trap
Batch file formats differ per provider:
- OpenAI: JSONL, one request per line.
- Anthropic: JSONL, one message per line; response format embedded.
- Vertex: BigQuery table or GCS prefix with TFRecord.
Writing "one batch client" across providers means adapter code per provider. Gateways that advertise multi-provider batch (Portkey, LiteLLM some tiers) still thin-wrap the raw format.
Numbers you should remember
- Batch discount across providers: 50% flat on input + output.
- Turnaround SLA: 24 hours guaranteed, 2-6 hours typical P50.
- Stacked batch + cached input: ~10% of sync uncached cost.
- Workload triage rule: if 24h latency acceptable, always batch.
Use It
code/main.py computes costs across sync, sync+cache, batch, and batch+cache for a 50k-document workload. Reports savings in $ and percent.
Ship It
This lesson produces outputs/skill-batch-triager.md. Given workload characteristics, triages into interactive/semi/batch and estimates savings.
Exercises
- Run
code/main.py. For a 100k-doc pipeline with 3K-token system prompt and 500-token output, compute the savings of full stack (batch + cache) vs sync baseline.
- Pick three features in a real product you know. Triage each into interactive/semi/batch.
- A user complains their report took 3 hours. Was that a batch mis-triage or a legitimate interactive? Write the decision criterion.
- Your batch API return SLA is 24h but P99 is 20 hours. How do you communicate this to the user — what is the downstream system behavior on the edge case?
- Compute break-even: at what shared-prefix length does batch + cache become cheaper than running overnight on your own reserved GPU?
Key Terms
| Term |
What people say |
What it actually means |
| Batch API |
"async discount" |
50% off with 24h turnaround |
| JSONL |
"batch format" |
One JSON request per line; OpenAI/Anthropic standard |
| Message Batches |
"Anthropic batch" |
Anthropic's batch API product name |
| Batch prediction |
"Vertex batch" |
Vertex AI's batch API product |
| Turnaround SLA |
"24h promise" |
Guarantee, not typical; typical is 2-6h |
| Workload triage |
"interactivity decision" |
Interactive / semi / batch routing decision |
| Output schema |
"response format" |
Per-provider JSONL layout; not portable |
| Stacked discount |
"batch + cache" |
~10% of uncached sync bill when both apply |
Further Reading
0 lifetime access. Curriculum based on AI Engineering from Scratch by Rohit Ghumare (MIT, used under attribution).