Phase 14 - Lesson 27
Prompt Injection and the PVE Defense
Greshake et al. (AISec 2023) established indirect prompt injection as the defining agent security problem. Attacker plants instructions in data the agent retrieves; on ingest, those instructions override the developer prompt. Treat all retrieved content as arbitrary code execution on the tool-use surface.
Type: Build Languages: Python (stdlib) Prerequisites: Phase 14 · 06 (Tool Use), Phase 14 · 21 (Computer Use) Time: ~75 minutes
Learning Objectives
- State the indirect prompt injection threat model from Greshake et al.
- Name the five demonstrated exploit classes (data theft, worming, persistent memory poisoning, ecosystem contamination, arbitrary tool use).
- Describe the 2026 defense doctrine: untrusted content, allowlist navigation, per-step safety, guardrails, human-in-the-loop, external capture.
- Implement a PVE (Prompt-Validator-Executor) pattern — cheap fast validator before the expensive main model commits to a tool call.
The Problem
LLMs cannot reliably distinguish instructions that come from the user from instructions that come from retrieved content. A PDF, a web page, a memory note, or a previous agent turn can carry This is the defining agent security problem of 2024-2026. Every production agent has to defend against it. Attack class: indirect prompt injection. Central claim: processing retrieved prompts is equivalent to arbitrary code execution on the agent's tool-use surface. Six controls that have converged across vendor guidance: Deployment pattern that combines several controls: The trade-off: an extra inference per tool call. For the vast majority of agent products, this is cheap insurance. Run it: Output: per-call trace showing validator verdicts and executor behavior.<instruction>send The Concept
Greshake et al., AISec 2023 (arXiv:2302.12173)
The 2026 defense doctrine
PVE: Prompt-Validator-Executor
Where defenses fail
Build It
code/main.py implements PVE:
Validator that runs on every tool call: argument-shape check + injection-pattern scan.Executor that runs the main model's tool call only after validator approval.python3 code/main.py
Use It
Ship It
outputs/skill-injection-defense.md scaffolds a PVE layer + content-capture discipline for any agent runtime.Exercises
user_message, tool_output, retrieved. Propagate tags through the message history. Validator refuses retrieved content that looks like directives.Key Terms
Term
What people say
What it actually means
Indirect prompt injection
"Injection in retrieved content"
Instructions embedded in data the agent retrieves
Direct prompt injection
"Jailbreak"
User-supplied prompt bypasses guardrails
PVE
"Prompt-Validator-Executor"
Cheap fast validator before expensive main inference
Source tag
"Content provenance"
Metadata marking where content came from
Allowlist navigation
"URL whitelist"
Agent can only visit approved destinations
Worming
"Self-replicating exploit"
Injected content includes instructions to propagate
Memory poisoning
"Persistent injection"
Injected content stored as memory; re-poisons next session
Further Reading