Phase 12
Multimodal AI
Phase 12: Multimodal AI. 25 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.
- Vision Transformers and the Patch-Token Primitive (graded)
- CLIP and Contrastive Vision-Language Pretraining (graded)
- From CLIP to BLIP-2 — Q-Former as Modality Bridge (graded)
- Flamingo and Gated Cross-Attention for Few-Shot VLMs (graded)
- LLaVA and Visual Instruction Tuning (graded)
- Any-Resolution Vision: Patch-n'-Pack and NaFlex (graded)
- Open-Weight VLM Recipes: What Actually Matters
- LLaVA-OneVision: Single-Image, Multi-Image, Video in One Model
- Qwen-VL Family and Dynamic-FPS Video
- InternVL3: Native Multimodal Pretraining
- Chameleon and Early-Fusion Token-Only Multimodal Models (graded)
- Emu3: Next-Token Prediction for Image and Video Generation
- Transfusion: Autoregressive Text + Diffusion Image in One Transformer (graded)
- Show-o and Discrete-Diffusion Unified Models (graded)
- Janus-Pro: Decoupled Encoders for Unified Multimodal Models
- MIO and Any-to-Any Streaming Multimodal Models
- Video-Language Models: Temporal Tokens and Grounding
- Long-Video Understanding at Million-Token Context
- Audio-Language Models: the Whisper to Audio Flamingo 3 Arc
- Omni Models: Qwen2.5-Omni and the Thinker-Talker Split
- Embodied VLAs: RT-2, OpenVLA, π0, GR00T
- Document and Diagram Understanding
- ColPali and Vision-Native Document RAG (graded)
- Multimodal RAG and Cross-Modal Retrieval (graded)
- Multimodal Agents and Computer-Use (Capstone)