Skip to content

Speech-to-Text (ASR)

Turning audio into text with automatic speech recognition โ€” the front door of every voice application, from transcription to voice assistants.

Overview

Automatic Speech Recognition (ASR) converts spoken audio into written text. Modern ASR is dominated by transformer models โ€” most famously OpenAI's open-source Whisper โ€” that transcribe many languages robustly, even with accents and background noise. This page shows how to transcribe audio, get timestamps, choose a model size, and avoid the pitfalls that make transcripts worse than they should be.

Learning Objectives

By the end of this page you will be able to:

  • Transcribe an audio file locally with Whisper.
  • Get word/segment timestamps for captions and search.
  • Trade off model size vs. speed vs. accuracy.
  • Decide between local and hosted ASR.

Theory

How modern ASR works (briefly)

Audio is first turned into a spectrogram (a picture of frequencies over time), which a transformer encoder consumes; a decoder then generates text tokens โ€” the same encoder-decoder idea from The Transformer, applied to sound. Because it's trained on huge multilingual data, one model handles many languages and can even translate speech to English.

flowchart LR
    A[๐ŸŽ™๏ธ Audio] --> S[Spectrogram] --> E[Encoder]
    E --> D[Decoder] --> T[Text + timestamps]

Model size: the core trade-off

Whisper (and most ASR) ships in sizes. Bigger = more accurate but slower and heavier:

Size Relative speed Use when
tiny / base Fastest Quick drafts, real-time on CPU, clean audio
small / medium Balanced Most production transcription
large Slowest, best Hard audio: accents, noise, domain terms

[!TIP] Don't default to large. Test a smaller model on your audio first โ€” it's often accurate enough at a fraction of the cost and latency.

Local vs. hosted

Local (Whisper) Hosted API
Cost Free compute (your hardware) Per minute of audio
Privacy Audio never leaves your machine Audio sent to provider
Setup Install + model download One API call
Best for Privacy-sensitive, high volume Fast start, no infra

Practical Example

Transcribe locally with Whisper

transcribe.py
import whisper  # pip install -U openai-whisper  (needs ffmpeg installed)

model = whisper.load_model("base")          # tiny/base/small/medium/large
result = model.transcribe("meeting.mp3")

print(result["text"])                        # full transcript
segments.py
result = model.transcribe("interview.wav")
for seg in result["segments"]:
    start, end, text = seg["start"], seg["end"], seg["text"]
    print(f"[{start:6.1f}s โ†’ {end:6.1f}s] {text.strip()}")
# [   0.0s โ†’    4.2s] Welcome to the show.
# [   4.2s โ†’    9.8s] Today we're talking about speech recognition.

Timestamps let you build subtitles (SRT/VTT), jump-to-quote search, and align a transcript with the audio for a meeting assistant.

Clean audio beats a bigger model

Correct sample rate, mono channel, and light noise reduction improve accuracy more than jumping up a model size. Garbage audio in โ†’ garbage transcript out.

Best Practices

  • โœ… Match model size to your accuracy/latency budget โ€” measure on real audio.
  • โœ… Pre-process: correct sample rate, convert to mono, reduce obvious noise.
  • โœ… Use timestamps for captions and to align text with audio.
  • โœ… For real-time, stream audio in chunks and use a small model.
  • โœ… Post-process domain terms (names, jargon) with a correction dictionary if needed.

Common Mistakes

  • โŒ Reaching for large when small would do โ€” wasted time and compute.
  • โŒ Feeding noisy, wrong-sample-rate audio and blaming the model.
  • โŒ Ignoring latency in live voice apps โ€” a long transcription lag feels broken.
  • โŒ Assuming perfect accuracy โ€” always allow for transcription errors downstream.

Exercises

  1. Transcribe the same clip with tiny, base, and small. Compare accuracy and time. Which is "good enough" for your use case?
  2. Generate an SRT subtitle file from the segment timestamps.
  3. Feed a noisy recording, then a cleaned version (mono, denoised). Measure the difference.

References