Phase 18

Ethics, Safety and Alignment

Phase 18: Ethics, Safety and Alignment. 30 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.

  1. Instruction-Following as Alignment Signal
  2. Reward Hacking and Goodhart's Law
  3. The Direct Preference Optimization Family
  4. Sycophancy as RLHF Amplification
  5. Constitutional AI and RLAIF
  6. Mesa-Optimization and Deceptive Alignment
  7. Sleeper Agents — Persistent Deception
  8. In-Context Scheming in Frontier Models
  9. Alignment Faking
  10. AI Control — Safety Despite Subversion
  11. Scalable Oversight and Weak-to-Strong Generalization
  12. Red-Teaming: PAIR and Automated Attacks
  13. Many-Shot Jailbreaking
  14. ASCII Art and Visual Jailbreaks
  15. Indirect Prompt Injection — Production Attack Surface
  16. Red-Team Tooling — Garak, Llama Guard, PyRIT
  17. WMDP and Dual-Use Capability Evaluation
  18. Frontier Safety Frameworks — RSP, PF, FSF
  19. Anthropic's Model Welfare Program
  20. Bias and Representational Harm in LLMs
  21. Fairness Criteria — Group, Individual, Counterfactual (graded)
  22. Differential Privacy for LLMs (graded)
  23. Watermarking — SynthID, Stable Signature, C2PA
  24. Regulatory Frameworks — EU, US, UK, Korea
  25. EchoLeak and the Emergence of CVEs for AI
  26. Model, System, and Dataset Cards
  27. Data Provenance and Training-Data Governance
  28. Alignment Research Ecosystem — MATS, Redwood, Apollo, METR
  29. Moderation Systems — OpenAI, Perspective, Llama Guard
  30. Dual-Use Risk — Cyber, Bio, Chem, Nuclear Uplift
0 lifetime access. Curriculum based on AI Engineering from Scratch by Rohit Ghumare (MIT, used under attribution).