Phase 18
Ethics, Safety and Alignment
Phase 18: Ethics, Safety and Alignment. 30 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.
- Instruction-Following as Alignment Signal
- Reward Hacking and Goodhart's Law
- The Direct Preference Optimization Family
- Sycophancy as RLHF Amplification
- Constitutional AI and RLAIF
- Mesa-Optimization and Deceptive Alignment
- Sleeper Agents — Persistent Deception
- In-Context Scheming in Frontier Models
- Alignment Faking
- AI Control — Safety Despite Subversion
- Scalable Oversight and Weak-to-Strong Generalization
- Red-Teaming: PAIR and Automated Attacks
- Many-Shot Jailbreaking
- ASCII Art and Visual Jailbreaks
- Indirect Prompt Injection — Production Attack Surface
- Red-Team Tooling — Garak, Llama Guard, PyRIT
- WMDP and Dual-Use Capability Evaluation
- Frontier Safety Frameworks — RSP, PF, FSF
- Anthropic's Model Welfare Program
- Bias and Representational Harm in LLMs
- Fairness Criteria — Group, Individual, Counterfactual (graded)
- Differential Privacy for LLMs (graded)
- Watermarking — SynthID, Stable Signature, C2PA
- Regulatory Frameworks — EU, US, UK, Korea
- EchoLeak and the Emergence of CVEs for AI
- Model, System, and Dataset Cards
- Data Provenance and Training-Data Governance
- Alignment Research Ecosystem — MATS, Redwood, Apollo, METR
- Moderation Systems — OpenAI, Perspective, Llama Guard
- Dual-Use Risk — Cyber, Bio, Chem, Nuclear Uplift