Skip to content

Deployment

Taking an AI feature from your laptop to a reliable, scalable service โ€” packaging, serving, and scaling.

Overview

Deploying an LLM app is mostly normal backend engineering plus a few AI-specific concerns: streaming responses, long request times, token-based cost, and (if you self-host models) GPU serving.

flowchart LR
    C[Client] --> LB[Load balancer]
    LB --> API[FastAPI service]
    API --> P{Model?}
    P -->|Hosted API| Prov[LLM Provider]
    P -->|Self-hosted| GPU[vLLM / TGI on GPU]
    API --> Cache[(Response / prompt cache)]

Two deployment models

Hosted API (recommended default) Self-hosted model
Setup Call a provider's API Run vLLM/TGI/Ollama on your hardware
Cost Per token Per GPU-hour (+ ops)
Best for Most apps, fast iteration Data control, high volume, custom models
Complexity Low High (GPUs, scaling, quantization)

Start with hosted APIs. Self-host only when data residency, cost at scale, or a custom model justifies the operational burden.

Learning Objectives

By the end of this section you will be able to:

  • Package an LLM app with Docker.
  • Serve streaming responses correctly (SSE) behind a load balancer.
  • Decide between hosted and self-hosted serving.
  • Apply caching and autoscaling to control cost and latency.

Quick taste: a production-ready container

Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml .
RUN pip install --no-cache-dir .
COPY . .
# Never bake secrets into the image โ€” inject at runtime.
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Best Practices

  • โœ… Inject secrets at runtime (env vars/secret manager), never in the image.
  • โœ… Stream responses (Server-Sent Events) so users see output immediately.
  • โœ… Set sensible timeouts โ€” LLM calls can take many seconds.
  • โœ… Cache where you can (identical prompts, embeddings) to cut cost and latency.
  • โœ… Add health checks, graceful shutdown, and retries with backoff.

Common Mistakes

  • โŒ Self-hosting a model before you need to โ€” huge ops cost for little benefit.
  • โŒ Short default timeouts that kill legitimate long generations.
  • โŒ No cost controls โ€” a bug or abuse can run up a large bill fast.
  • โŒ Baking API keys into Docker images or committing them.

๐Ÿ Help build this section

Claim a topic by opening an issue:

  • [WANTED] FastAPI + streaming LLM service โ€” full runnable template ๐ŸŸก
  • [WANTED] Serving models with vLLM โ€” throughput, batching ๐Ÿ”ด
  • [WANTED] Kubernetes for AI services โ€” autoscaling, GPUs ๐Ÿ”ด
  • [WANTED] Quantization for cheaper inference โ€” GGUF, AWQ, GPTQ ๐Ÿ”ด
  • [WANTED] Caching strategies โ€” prompt/response/embedding caches ๐ŸŸก

References