Phase 12

Multimodal AI

Phase 12: Multimodal AI. 25 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.

  1. Vision Transformers and the Patch-Token Primitive (graded)
  2. CLIP and Contrastive Vision-Language Pretraining (graded)
  3. From CLIP to BLIP-2 — Q-Former as Modality Bridge (graded)
  4. Flamingo and Gated Cross-Attention for Few-Shot VLMs (graded)
  5. LLaVA and Visual Instruction Tuning (graded)
  6. Any-Resolution Vision: Patch-n'-Pack and NaFlex (graded)
  7. Open-Weight VLM Recipes: What Actually Matters
  8. LLaVA-OneVision: Single-Image, Multi-Image, Video in One Model
  9. Qwen-VL Family and Dynamic-FPS Video
  10. InternVL3: Native Multimodal Pretraining
  11. Chameleon and Early-Fusion Token-Only Multimodal Models (graded)
  12. Emu3: Next-Token Prediction for Image and Video Generation
  13. Transfusion: Autoregressive Text + Diffusion Image in One Transformer (graded)
  14. Show-o and Discrete-Diffusion Unified Models (graded)
  15. Janus-Pro: Decoupled Encoders for Unified Multimodal Models
  16. MIO and Any-to-Any Streaming Multimodal Models
  17. Video-Language Models: Temporal Tokens and Grounding
  18. Long-Video Understanding at Million-Token Context
  19. Audio-Language Models: the Whisper to Audio Flamingo 3 Arc
  20. Omni Models: Qwen2.5-Omni and the Thinker-Talker Split
  21. Embodied VLAs: RT-2, OpenVLA, π0, GR00T
  22. Document and Diagram Understanding
  23. ColPali and Vision-Native Document RAG (graded)
  24. Multimodal RAG and Cross-Modal Retrieval (graded)
  25. Multimodal Agents and Computer-Use (Capstone)
0 lifetime access. Curriculum based on AI Engineering from Scratch by Rohit Ghumare (MIT, used under attribution).