Speech-to-Text (ASR)¶
Turning audio into text with automatic speech recognition โ the front door of every voice application, from transcription to voice assistants.
Overview¶
Automatic Speech Recognition (ASR) converts spoken audio into written text. Modern ASR is dominated by transformer models โ most famously OpenAI's open-source Whisper โ that transcribe many languages robustly, even with accents and background noise. This page shows how to transcribe audio, get timestamps, choose a model size, and avoid the pitfalls that make transcripts worse than they should be.
Learning Objectives¶
By the end of this page you will be able to:
- Transcribe an audio file locally with Whisper.
- Get word/segment timestamps for captions and search.
- Trade off model size vs. speed vs. accuracy.
- Decide between local and hosted ASR.
Theory¶
How modern ASR works (briefly)¶
Audio is first turned into a spectrogram (a picture of frequencies over time), which a transformer encoder consumes; a decoder then generates text tokens โ the same encoder-decoder idea from The Transformer, applied to sound. Because it's trained on huge multilingual data, one model handles many languages and can even translate speech to English.
flowchart LR
A[๐๏ธ Audio] --> S[Spectrogram] --> E[Encoder]
E --> D[Decoder] --> T[Text + timestamps]
Model size: the core trade-off¶
Whisper (and most ASR) ships in sizes. Bigger = more accurate but slower and heavier:
| Size | Relative speed | Use when |
|---|---|---|
tiny / base |
Fastest | Quick drafts, real-time on CPU, clean audio |
small / medium |
Balanced | Most production transcription |
large |
Slowest, best | Hard audio: accents, noise, domain terms |
[!TIP] Don't default to
large. Test a smaller model on your audio first โ it's often accurate enough at a fraction of the cost and latency.
Local vs. hosted¶
| Local (Whisper) | Hosted API | |
|---|---|---|
| Cost | Free compute (your hardware) | Per minute of audio |
| Privacy | Audio never leaves your machine | Audio sent to provider |
| Setup | Install + model download | One API call |
| Best for | Privacy-sensitive, high volume | Fast start, no infra |
Practical Example¶
Transcribe locally with Whisper¶
import whisper # pip install -U openai-whisper (needs ffmpeg installed)
model = whisper.load_model("base") # tiny/base/small/medium/large
result = model.transcribe("meeting.mp3")
print(result["text"]) # full transcript
Get timestamped segments (for captions or search)¶
result = model.transcribe("interview.wav")
for seg in result["segments"]:
start, end, text = seg["start"], seg["end"], seg["text"]
print(f"[{start:6.1f}s โ {end:6.1f}s] {text.strip()}")
# [ 0.0s โ 4.2s] Welcome to the show.
# [ 4.2s โ 9.8s] Today we're talking about speech recognition.
Timestamps let you build subtitles (SRT/VTT), jump-to-quote search, and align a transcript with the audio for a meeting assistant.
Clean audio beats a bigger model
Correct sample rate, mono channel, and light noise reduction improve accuracy more than jumping up a model size. Garbage audio in โ garbage transcript out.
Best Practices¶
- โ Match model size to your accuracy/latency budget โ measure on real audio.
- โ Pre-process: correct sample rate, convert to mono, reduce obvious noise.
- โ Use timestamps for captions and to align text with audio.
- โ For real-time, stream audio in chunks and use a small model.
- โ Post-process domain terms (names, jargon) with a correction dictionary if needed.
Common Mistakes¶
- โ Reaching for
largewhensmallwould do โ wasted time and compute. - โ Feeding noisy, wrong-sample-rate audio and blaming the model.
- โ Ignoring latency in live voice apps โ a long transcription lag feels broken.
- โ Assuming perfect accuracy โ always allow for transcription errors downstream.
Exercises¶
- Transcribe the same clip with
tiny,base, andsmall. Compare accuracy and time. Which is "good enough" for your use case? - Generate an SRT subtitle file from the segment timestamps.
- Feed a noisy recording, then a cleaned version (mono, denoised). Measure the difference.