Skip to content

Your First LLM Call

Send a prompt to a large language model, get a response, and understand every line of code.

Overview

An LLM API call is deceptively simple: you send messages, you get back text. But a few concepts β€” roles, tokens, temperature, and streaming β€” show up in every AI application. We'll meet them here in the smallest possible program.

Learning Objectives

After this page you will be able to:

  • Make a basic chat completion request.
  • Explain the system, user, and assistant roles.
  • Control output with max_tokens and temperature.
  • Stream a response token-by-token.

Theory: what actually happens

When you "call an LLM," you send a list of messages. Each message has a role:

Role Who it represents
system Instructions that shape the model's behavior (persona, rules, format)
user What the human said
assistant What the model said (in prior turns, for multi-turn chat)

The model reads the whole conversation and predicts the next assistant message, one token (β‰ˆ a word-piece) at a time. That's it. Everything else β€” RAG, agents, tools β€” is built on this loop.

sequenceDiagram
    participant You
    participant SDK as Anthropic SDK
    participant LLM as Claude
    You->>SDK: messages=[{user: "Explain RAG in one sentence"}]
    SDK->>LLM: HTTPS request
    LLM-->>SDK: streamed tokens β†’
    SDK-->>You: "RAG retrieves relevant documents…"

Practical Example

first_call.py
import os
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()  # loads ANTHROPIC_API_KEY from .env
client = Anthropic()  # reads the key from the environment automatically

response = client.messages.create(
    model="claude-sonnet-5",          # a fast, capable general-purpose model
    max_tokens=200,                   # cap the response length (controls cost)
    system="You are a concise technical teacher.",  # shapes behavior
    messages=[
        {"role": "user", "content": "Explain retrieval-augmented generation in one sentence."}
    ],
)

print(response.content[0].text)
print(f"\nTokens β€” in: {response.usage.input_tokens}, out: {response.usage.output_tokens}")

Run it:

python first_call.py

You'll see a one-sentence answer plus a token count. Those token counts are your bill β€” most providers charge per input and output token, so tracking them is a production habit worth forming early.

Controlling the output

  • max_tokens β€” a hard cap on the response length. Too low and you get cut off; too high wastes money on rambling.
  • temperature (0.0–1.0) β€” randomness. 0 is nearly deterministic (good for extraction, classification); higher values are more creative (good for brainstorming).
response = client.messages.create(
    model="claude-sonnet-5",
    max_tokens=200,
    temperature=0.0,   # deterministic β€” same input tends to give the same output
    messages=[{"role": "user", "content": "List three uses of embeddings."}],
)

Streaming (so users see text as it's generated)

streaming.py
with client.messages.stream(
    model="claude-sonnet-5",
    max_tokens=300,
    messages=[{"role": "user", "content": "Write a haiku about vector databases."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)   # prints tokens as they arrive
print()

Streaming doesn't make the model faster β€” it makes the experience feel faster, because the user reads the first words while the rest is still being generated.

Using a different provider?

The SDK differs but the shape is identical: messages in, text out. OpenAI's client.chat.completions.create(...), Google's genai, and local Ollama all follow the same mental model.

Best Practices

  • βœ… Always set max_tokens β€” it's your primary cost guardrail.
  • βœ… Use temperature=0 for anything that should be consistent (parsing, classification).
  • βœ… Stream for user-facing chat; buffer for background jobs.
  • βœ… Log token usage from day one.

Common Mistakes

  • ❌ Hardcoding the API key instead of loading it from the environment.
  • ❌ Forgetting max_tokens, then being surprised by a long, expensive response.
  • ❌ Assuming temperature=0 is fully deterministic β€” it's close, not guaranteed.
  • ❌ Blocking the UI while waiting for a full response instead of streaming.

Exercises

  1. Change the system prompt to make the model answer as a pirate. Notice how much behavior the system prompt controls.
  2. Set temperature=1.0 and run the haiku prompt five times. How different are the outputs?
  3. Print the cost of a call: multiply input/output tokens by your provider's per-token price.

References


Next: pick a Learning Path β†’ or dive into How LLMs Work β†’.