Phase 12

Multimodal AI

Phase 12: Multimodal AI. 25 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.

Vision Transformers and the Patch-Token Primitive (graded)
CLIP and Contrastive Vision-Language Pretraining (graded)
From CLIP to BLIP-2 — Q-Former as Modality Bridge (graded)
Flamingo and Gated Cross-Attention for Few-Shot VLMs (graded)
LLaVA and Visual Instruction Tuning (graded)
Any-Resolution Vision: Patch-n'-Pack and NaFlex (graded)
Open-Weight VLM Recipes: What Actually Matters
LLaVA-OneVision: Single-Image, Multi-Image, Video in One Model
Qwen-VL Family and Dynamic-FPS Video
InternVL3: Native Multimodal Pretraining
Chameleon and Early-Fusion Token-Only Multimodal Models (graded)
Emu3: Next-Token Prediction for Image and Video Generation
Transfusion: Autoregressive Text + Diffusion Image in One Transformer (graded)
Show-o and Discrete-Diffusion Unified Models (graded)
Janus-Pro: Decoupled Encoders for Unified Multimodal Models
MIO and Any-to-Any Streaming Multimodal Models
Video-Language Models: Temporal Tokens and Grounding
Long-Video Understanding at Million-Token Context
Audio-Language Models: the Whisper to Audio Flamingo 3 Arc
Omni Models: Qwen2.5-Omni and the Thinker-Talker Split
Embodied VLAs: RT-2, OpenVLA, π0, GR00T
Document and Diagram Understanding
ColPali and Vision-Native Document RAG (graded)
Multimodal RAG and Cross-Modal Retrieval (graded)
Multimodal Agents and Computer-Use (Capstone)