Phase 06

Speech and Audio

Phase 6: Speech and Audio. 17 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.

Audio Fundamentals — Waveforms, Sampling, Fourier Transform (graded)
Spectrograms, Mel Scale & Audio Features (graded)
Audio Classification — From k-NN on MFCCs to AST and BEATs (graded)
Speech Recognition (ASR) — CTC, RNN-T, Attention (graded)
Whisper — Architecture & Fine-Tuning
Speaker Recognition & Verification (graded)
Text-to-Speech (TTS) — From Tacotron to F5 and Kokoro
Voice Cloning & Voice Conversion
Music Generation — MusicGen, Stable Audio, Suno, and the Licensing Earthquake
Audio-Language Models — Qwen2.5-Omni, Audio Flamingo, GPT-4o Audio
Real-Time Audio Processing (graded)
Build a Voice Assistant Pipeline — The Phase 6 Capstone (graded)
Neural Audio Codecs — EnCodec, SNAC, Mimi, DAC and the Semantic-Acoustic Split (graded)
Voice Activity Detection & Turn-Taking — Silero, Cobra, and the Flush Trick (graded)
Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue
Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerify (graded)
Audio Evaluation — WER, MOS, UTMOS, MMAU, FAD, and the Open Leaderboards (graded)