Skip to content

Speech AI

Turning speech into text (ASR), text into speech (TTS), and building voice-driven AI applications.

Overview

Voice interfaces are built from a pipeline of three stages, often wrapped around an LLM:

flowchart LR
    A[๐ŸŽ™๏ธ Audio in] --> B[ASR<br/>speech โ†’ text]
    B --> C[๐Ÿง  LLM<br/>understand & respond]
    C --> D[TTS<br/>text โ†’ speech]
    D --> E[๐Ÿ”Š Audio out]
  • ASR (Automatic Speech Recognition) โ€” transcribe audio to text. Whisper is the open standard; hosted APIs are often faster and simpler.
  • TTS (Text-to-Speech) โ€” synthesize natural-sounding speech from text.
  • Diarization โ€” figure out who spoke when (speaker labels).

Learning Objectives

By the end of this section you will be able to:

  • Transcribe audio with an ASR model and handle timestamps.
  • Choose between local (Whisper) and hosted transcription.
  • Build a simple voice-assistant loop (ASR โ†’ LLM โ†’ TTS).
  • Understand latency budgets for real-time voice.

Quick taste: transcribe audio

transcribe.py
# Using the open-source Whisper model locally (CPU works, GPU is faster).
import whisper

model = whisper.load_model("base")     # tiny/base/small/medium/large
result = model.transcribe("meeting.mp3")
print(result["text"])

Best Practices

  • โœ… Match model size to your latency and accuracy needs โ€” bigger isn't always worth it.
  • โœ… Stream audio for real-time UX; batch for offline transcription.
  • โœ… Clean audio (noise reduction, correct sample rate) before ASR.

Common Mistakes

  • โŒ Ignoring latency โ€” a 3-second round trip feels broken in a voice conversation.
  • โŒ Sending raw, noisy audio and blaming the model for poor transcripts.
  • โŒ Forgetting to handle silence and interruptions in live voice apps.

๐Ÿ Help build this section

Claim a topic by opening an issue:

  • โœ… Speech-to-Text (ASR) โ€” Whisper, timestamps, model sizing ๐ŸŸก
  • [WANTED] Real-time voice pipelines โ€” streaming, VAD, barge-in ๐Ÿ”ด
  • โœ… Text-to-Speech (TTS) โ€” local vs hosted, streaming, SSML ๐ŸŸก
  • [WANTED] Speaker diarization โ€” who said what ๐Ÿ”ด
  • [WANTED] Meeting assistant example โ€” record โ†’ transcribe โ†’ summarize ๐ŸŸก

References