Phase 06

Speech and Audio

Phase 6: Speech and Audio. 17 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.

  1. Audio Fundamentals — Waveforms, Sampling, Fourier Transform (graded)
  2. Spectrograms, Mel Scale & Audio Features (graded)
  3. Audio Classification — From k-NN on MFCCs to AST and BEATs (graded)
  4. Speech Recognition (ASR) — CTC, RNN-T, Attention (graded)
  5. Whisper — Architecture & Fine-Tuning
  6. Speaker Recognition & Verification (graded)
  7. Text-to-Speech (TTS) — From Tacotron to F5 and Kokoro
  8. Voice Cloning & Voice Conversion
  9. Music Generation — MusicGen, Stable Audio, Suno, and the Licensing Earthquake
  10. Audio-Language Models — Qwen2.5-Omni, Audio Flamingo, GPT-4o Audio
  11. Real-Time Audio Processing (graded)
  12. Build a Voice Assistant Pipeline — The Phase 6 Capstone (graded)
  13. Neural Audio Codecs — EnCodec, SNAC, Mimi, DAC and the Semantic-Acoustic Split (graded)
  14. Voice Activity Detection & Turn-Taking — Silero, Cobra, and the Flush Trick (graded)
  15. Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue
  16. Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerify (graded)
  17. Audio Evaluation — WER, MOS, UTMOS, MMAU, FAD, and the Open Leaderboards (graded)
0 lifetime access. Curriculum based on AI Engineering from Scratch by Rohit Ghumare (MIT, used under attribution).