Deployment¶
Taking an AI feature from your laptop to a reliable, scalable service โ packaging, serving, and scaling.
Overview¶
Deploying an LLM app is mostly normal backend engineering plus a few AI-specific concerns: streaming responses, long request times, token-based cost, and (if you self-host models) GPU serving.
flowchart LR
C[Client] --> LB[Load balancer]
LB --> API[FastAPI service]
API --> P{Model?}
P -->|Hosted API| Prov[LLM Provider]
P -->|Self-hosted| GPU[vLLM / TGI on GPU]
API --> Cache[(Response / prompt cache)]
Two deployment models¶
| Hosted API (recommended default) | Self-hosted model | |
|---|---|---|
| Setup | Call a provider's API | Run vLLM/TGI/Ollama on your hardware |
| Cost | Per token | Per GPU-hour (+ ops) |
| Best for | Most apps, fast iteration | Data control, high volume, custom models |
| Complexity | Low | High (GPUs, scaling, quantization) |
Start with hosted APIs. Self-host only when data residency, cost at scale, or a custom model justifies the operational burden.
Learning Objectives¶
By the end of this section you will be able to:
- Package an LLM app with Docker.
- Serve streaming responses correctly (SSE) behind a load balancer.
- Decide between hosted and self-hosted serving.
- Apply caching and autoscaling to control cost and latency.
Quick taste: a production-ready container¶
Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml .
RUN pip install --no-cache-dir .
COPY . .
# Never bake secrets into the image โ inject at runtime.
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Best Practices¶
- โ Inject secrets at runtime (env vars/secret manager), never in the image.
- โ Stream responses (Server-Sent Events) so users see output immediately.
- โ Set sensible timeouts โ LLM calls can take many seconds.
- โ Cache where you can (identical prompts, embeddings) to cut cost and latency.
- โ Add health checks, graceful shutdown, and retries with backoff.
Common Mistakes¶
- โ Self-hosting a model before you need to โ huge ops cost for little benefit.
- โ Short default timeouts that kill legitimate long generations.
- โ No cost controls โ a bug or abuse can run up a large bill fast.
- โ Baking API keys into Docker images or committing them.
๐ Help build this section¶
Claim a topic by opening an issue:
[WANTED]FastAPI + streaming LLM service โ full runnable template ๐ก[WANTED]Serving models with vLLM โ throughput, batching ๐ด[WANTED]Kubernetes for AI services โ autoscaling, GPUs ๐ด[WANTED]Quantization for cheaper inference โ GGUF, AWQ, GPTQ ๐ด[WANTED]Caching strategies โ prompt/response/embedding caches ๐ก