Your First LLM Call¶
Send a prompt to a large language model, get a response, and understand every line of code.
Overview¶
An LLM API call is deceptively simple: you send messages, you get back text. But a few concepts β roles, tokens, temperature, and streaming β show up in every AI application. We'll meet them here in the smallest possible program.
Learning Objectives¶
After this page you will be able to:
- Make a basic chat completion request.
- Explain the
system,user, andassistantroles. - Control output with
max_tokensandtemperature. - Stream a response token-by-token.
Theory: what actually happens¶
When you "call an LLM," you send a list of messages. Each message has a role:
| Role | Who it represents |
|---|---|
system |
Instructions that shape the model's behavior (persona, rules, format) |
user |
What the human said |
assistant |
What the model said (in prior turns, for multi-turn chat) |
The model reads the whole conversation and predicts the next assistant message, one token
(β a word-piece) at a time. That's it. Everything else β RAG, agents, tools β is built on this
loop.
sequenceDiagram
participant You
participant SDK as Anthropic SDK
participant LLM as Claude
You->>SDK: messages=[{user: "Explain RAG in one sentence"}]
SDK->>LLM: HTTPS request
LLM-->>SDK: streamed tokens β
SDK-->>You: "RAG retrieves relevant documentsβ¦"
Practical Example¶
import os
from dotenv import load_dotenv
from anthropic import Anthropic
load_dotenv() # loads ANTHROPIC_API_KEY from .env
client = Anthropic() # reads the key from the environment automatically
response = client.messages.create(
model="claude-sonnet-5", # a fast, capable general-purpose model
max_tokens=200, # cap the response length (controls cost)
system="You are a concise technical teacher.", # shapes behavior
messages=[
{"role": "user", "content": "Explain retrieval-augmented generation in one sentence."}
],
)
print(response.content[0].text)
print(f"\nTokens β in: {response.usage.input_tokens}, out: {response.usage.output_tokens}")
Run it:
You'll see a one-sentence answer plus a token count. Those token counts are your bill β most providers charge per input and output token, so tracking them is a production habit worth forming early.
Controlling the output¶
max_tokensβ a hard cap on the response length. Too low and you get cut off; too high wastes money on rambling.temperature(0.0β1.0) β randomness.0is nearly deterministic (good for extraction, classification); higher values are more creative (good for brainstorming).
response = client.messages.create(
model="claude-sonnet-5",
max_tokens=200,
temperature=0.0, # deterministic β same input tends to give the same output
messages=[{"role": "user", "content": "List three uses of embeddings."}],
)
Streaming (so users see text as it's generated)¶
with client.messages.stream(
model="claude-sonnet-5",
max_tokens=300,
messages=[{"role": "user", "content": "Write a haiku about vector databases."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True) # prints tokens as they arrive
print()
Streaming doesn't make the model faster β it makes the experience feel faster, because the user reads the first words while the rest is still being generated.
Using a different provider?
The SDK differs but the shape is identical: messages in, text out. OpenAI's
client.chat.completions.create(...), Google's genai, and local
Ollama all follow the same mental model.
Best Practices¶
- β
Always set
max_tokensβ it's your primary cost guardrail. - β
Use
temperature=0for anything that should be consistent (parsing, classification). - β Stream for user-facing chat; buffer for background jobs.
- β Log token usage from day one.
Common Mistakes¶
- β Hardcoding the API key instead of loading it from the environment.
- β Forgetting
max_tokens, then being surprised by a long, expensive response. - β Assuming
temperature=0is fully deterministic β it's close, not guaranteed. - β Blocking the UI while waiting for a full response instead of streaming.
Exercises¶
- Change the
systemprompt to make the model answer as a pirate. Notice how much behavior the system prompt controls. - Set
temperature=1.0and run the haiku prompt five times. How different are the outputs? - Print the cost of a call: multiply input/output tokens by your provider's per-token price.
References¶
- Anthropic Messages API
- What are tokens? β Bee's tokenization guide
- Prompt Engineering β make your prompts better
Next: pick a Learning Path β or dive into How LLMs Work β.