Phase 18

Ethics, Safety and Alignment

Phase 18: Ethics, Safety and Alignment. 30 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.

Instruction-Following as Alignment Signal
Reward Hacking and Goodhart's Law
The Direct Preference Optimization Family
Sycophancy as RLHF Amplification
Constitutional AI and RLAIF
Mesa-Optimization and Deceptive Alignment
Sleeper Agents — Persistent Deception
In-Context Scheming in Frontier Models
Alignment Faking
AI Control — Safety Despite Subversion
Scalable Oversight and Weak-to-Strong Generalization
Red-Teaming: PAIR and Automated Attacks
Many-Shot Jailbreaking
ASCII Art and Visual Jailbreaks
Indirect Prompt Injection — Production Attack Surface
Red-Team Tooling — Garak, Llama Guard, PyRIT
WMDP and Dual-Use Capability Evaluation
Frontier Safety Frameworks — RSP, PF, FSF
Anthropic's Model Welfare Program
Bias and Representational Harm in LLMs
Fairness Criteria — Group, Individual, Counterfactual (graded)
Differential Privacy for LLMs (graded)
Watermarking — SynthID, Stable Signature, C2PA
Regulatory Frameworks — EU, US, UK, Korea
EchoLeak and the Emergence of CVEs for AI
Model, System, and Dataset Cards
Data Provenance and Training-Data Governance
Alignment Research Ecosystem — MATS, Redwood, Apollo, METR
Moderation Systems — OpenAI, Perspective, Llama Guard
Dual-Use Risk — Cyber, Bio, Chem, Nuclear Uplift