Phase 06
Speech and Audio
Phase 6: Speech and Audio. 17 hands-on lessons building AI from first principles in the browser. Free reading; graded exercises and certificate with lifetime access.
- Audio Fundamentals — Waveforms, Sampling, Fourier Transform (graded)
- Spectrograms, Mel Scale & Audio Features (graded)
- Audio Classification — From k-NN on MFCCs to AST and BEATs (graded)
- Speech Recognition (ASR) — CTC, RNN-T, Attention (graded)
- Whisper — Architecture & Fine-Tuning
- Speaker Recognition & Verification (graded)
- Text-to-Speech (TTS) — From Tacotron to F5 and Kokoro
- Voice Cloning & Voice Conversion
- Music Generation — MusicGen, Stable Audio, Suno, and the Licensing Earthquake
- Audio-Language Models — Qwen2.5-Omni, Audio Flamingo, GPT-4o Audio
- Real-Time Audio Processing (graded)
- Build a Voice Assistant Pipeline — The Phase 6 Capstone (graded)
- Neural Audio Codecs — EnCodec, SNAC, Mimi, DAC and the Semantic-Acoustic Split (graded)
- Voice Activity Detection & Turn-Taking — Silero, Cobra, and the Flush Trick (graded)
- Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue
- Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerify (graded)
- Audio Evaluation — WER, MOS, UTMOS, MMAU, FAD, and the Open Leaderboards (graded)