Phase 14 - Lesson 13

LangGraph: Stateful Graphs and Durable Execution

LangGraph is the 2026 reference for low-level stateful orchestration. Agent is a state machine; nodes are functions; edges are transitions; state is immutable and checkpointed after every step. Resume from any failure exactly where it left off.

Type: Learn + Build Languages: Python (stdlib) Prerequisites: Phase 14 · 01 (Agent Loop), Phase 14 · 12 (Workflow Patterns) Time: ~75 minutes

Learning Objectives

  • Describe LangGraph's core model: state machine with immutable state, function nodes, conditional edges, and post-step checkpoints.
  • Name the four capabilities the docs highlight: durable execution, streaming, human-in-the-loop, comprehensive memory.
  • Explain the three orchestration topologies LangGraph supports: supervisor, peer-to-peer (swarm), hierarchical (nested subgraphs).
  • Implement a stdlib state graph with immutable state, conditional edges, and a checkpoint/resume cycle.

The Problem

Agents and workflows share a problem: when a 40-step run fails at step 38, you want to resume from step 38, not start over. Second-class state models leave operators hacking retries around a library that assumes fresh runs.

LangGraph's design answer: state is a first-class typed object, mutations are explicit, and checkpoints persist after every node. Resume is a load_state(session_id) call.

The Concept

The graph

A graph is defined by:

  • State type. A typed dict (or Pydantic model) that every node reads and mutates.
  • Nodes. Pure functions (state) -> state_update. Updates are merged into state after return.
  • Edges. Conditional or direct transitions between nodes.
  • Entry and exit. START and END sentinel nodes mark the boundary.

Example: an agent with classify, refund, bug, sales, done nodes — a routing workflow as a graph.

Durable execution

After each node returns, the runtime serializes the state and writes it to a checkpointer (SQLite, Postgres, Redis, custom). On failure at step N, the runtime can resume(session_id) and pick up from step N+1 with exact state.

The LangGraph docs explicitly highlight production users where this matters: Klarna, Uber, J.P. Morgan. The claim isn't the graph shape; it's that the graph shape plus checkpointing makes recovery cheap.

Streaming

Every node can yield partial output. The graph streams per-node-delta events to the caller so UIs update as the graph runs.

Human-in-the-loop

Inspect and modify state between nodes. Implementations: pause before a critical node, surface state to a human, accept modifications, resume. The checkpointer makes this easy because state is already serialized.

Memory

Short-term (within a run — conversation history in state) and long-term (across runs — persistent via the checkpointer plus a separate long-term store). LangGraph integrates with external memory systems (Mem0, custom) via tools.

Three topologies

  1. Supervisor. Central router LLM dispatches to specialist subagents. create_supervisor() in langgraph-supervisor (though the LangChain team in 2026 recommends doing this through tool calls directly for more context control).
  2. Swarm / peer-to-peer. Agents hand off directly via a shared tool surface. No central router.
  3. Hierarchical. Supervisors managing sub-supervisors, implemented as nested subgraphs.

Where this pattern goes wrong

  • Checkpoints too small. Only checkpointing conversation turns leaves tool state and memory writes unrecoverable. Full state must serialize.
  • Non-deterministic nodes. Resume assumes node inputs produce the same state update. Random seeds, wall-clock, external APIs must be captured.
  • Over-use of conditional edges. A graph with every edge conditional is a state machine that cannot be reasoned about. Prefer linear chains with occasional branches.

Build It

code/main.py implements a stdlib stateful graph:

  • State — a typed dict with messages, step, route, output, human_approval.
  • Node — callable taking state and returning an update dict.
  • StateGraph — nodes + edges + conditional edges + run + resume.
  • SQLiteCheckpointer (in-memory fake) — serializes state after every node; load(session_id) restores.
  • A demo graph: classify -> branch(refund / bug / sales) -> human gate -> send.

Run it:

python3 code/main.py

The trace shows the first run failing at the human gate, persistence, then resume producing the final output.

Use It

  • LangGraph — the reference, production-ready. Use create_react_agent, create_supervisor, or build your own graph.
  • AutoGen v0.4 (Lesson 14) — actor model alternative for high-concurrency scenarios.
  • Claude Agent SDK (Lesson 17) — managed harness with built-in session store.
  • Custom — when you need exact control over state shape or checkpointer backend.

Ship It

outputs/skill-state-graph.md generates a LangGraph-shaped state graph in any target runtime with checkpointing and resume wired in.

Exercises

  1. Add a conditional edge from classify to end when classification confidence is below a threshold. Resume the run after a human sets route manually.
  2. Swap the SQLite-like fake for a real SQLite checkpointer. Measure per-step serialization overhead.
  3. Implement parallel edges: two nodes run concurrently, merge by a custom reducer. What does immutable state buy here?
  4. Read langgraph-supervisor reference. Port the toy to create_supervisor. Compare the trace shapes.
  5. Add streaming: each node yields partial state while it runs. Print the deltas as they arrive.

Key Terms

Term What people say What it actually means
State graph "Agent as state machine" Typed state + nodes + edges + reducers
Checkpointer "Persistence backend" Serializes state after every node; enables resume
Reducer "State merger" Function that combines current state with a node's update
Conditional edge "Branch" Edge chosen by a function of state
Subgraph "Nested graph" A graph used as a node inside another graph
Durable execution "Resume from failure" Restart at the last successful node with exact state
Supervisor "Router LLM" Central dispatcher for specialist subagents
Swarm "P2P agents" Agents hand off via shared tools; no central router

Further Reading

0 lifetime access. Curriculum based on AI Engineering from Scratch by Rohit Ghumare (MIT, used under attribution).