Phase 17 - Lesson 15

Batch APIs — the 50% Discount as Industry Standard

This lesson includes a graded coding exercise that runs in your browser, unlocked with lifetime access.

Every major provider ships an async batch API with a 50% discount and ~24-hour turnaround. OpenAI, Anthropic, Google, and most of the inference platforms (Fireworks batch tier, Together batch) implement the same pattern. Stack batch with prompt caching and overnight pipelines drop to ~10% of synchronous-uncached cost. The rule is brutally simple: if it is not interactive, it belongs on batch. Content generation pipelines, document classification, data extraction, report generation, bulk labeling, catalog tagging — anything tolerant of 24-hour latency is money left on the table until it moves to batch. The 2026 production pattern is to triage every new LLM workload into three lanes: interactive (synchronous with caching), semi-interactive (async queue with fallback), batch (overnight, cached input stacked). Workloads that pretend to be interactive but tolerate minutes of latency waste most.

Type: Learn Languages: Python (stdlib, toy batch-vs-sync cost simulator) Prerequisites: Phase 17 · 14 (Prompt & Semantic Caching) Time: ~45 minutes

Learning Objectives

Name the three provider batch APIs (OpenAI, Anthropic, Google) and the common 50% discount + 24h turnaround guarantees.
Compute the cost for stacking batch + cached-input on an overnight classification workload and compare to synchronous-uncached baseline.
Triage a workload into interactive / semi-interactive / batch and justify the lane.
Name the two traps: partial interactivity (user expects faster than 24h) and output-schema drift (batch file format differs per provider).

The Problem

Your team ships a nightly report generation pipeline. 50,000 documents, summarize each, cluster the summaries, draft an executive brief. Running synchronously it takes 4 hours at

Term	What people say	What it actually means
Batch API	"async discount"	50% off with 24h turnaround
JSONL	"batch format"	One JSON request per line; OpenAI/Anthropic standard
Message Batches	"Anthropic batch"	Anthropic's batch API product name
Batch prediction	"Vertex batch"	Vertex AI's batch API product
Turnaround SLA	"24h promise"	Guarantee, not typical; typical is 2-6h
Workload triage	"interactivity decision"	Interactive / semi / batch routing decision
Output schema	"response format"	Per-provider JSONL layout; not portable
Stacked discount	"batch + cache"	~10% of uncached sync bill when both apply

Batch APIs — the 50% Discount as Industry Standard

Learning Objectives

The Problem

The Concept

The three batch APIs

Semantic: asynchronous, not slow

Stack with caching

Workload triage

The partial-interactivity trap

The output-schema trap

Numbers you should remember

Use It

Ship It

Exercises

Key Terms

Further Reading