Computer Vision & Multimodal¶
How modern AI sees โ from image understanding with multimodal LLMs to OCR, detection, and image generation.
Overview¶
"Vision AI" today spans two worlds that are rapidly merging:
- Multimodal LLMs โ models that accept images and text in the same prompt, so you can ask questions about a screenshot, a chart, or a photo in natural language.
- Specialized vision models โ OCR, object detection, segmentation, and image generation (diffusion) that do one thing extremely well.
Most applications combine them: a multimodal LLM for reasoning, a specialized model for the heavy lifting.
What this section covers¶
mindmap
root((Vision))
Multimodal LLMs
Image + text prompts
Visual question answering
Document understanding
OCR
Printed & handwritten text
Layout & tables
Recognition
Classification
Object detection
Segmentation
Generation
Diffusion models
Image editing / inpainting
Learning Objectives¶
By the end of this section you will be able to:
- Send an image to a multimodal LLM and reason about the response.
- Choose between a multimodal LLM and a specialized model for a task.
- Build a document-understanding pipeline (OCR โ structure โ LLM).
- Understand how diffusion models generate images at a high level.
Quick taste: ask an LLM about an image¶
describe_image.py
import base64
from anthropic import Anthropic
client = Anthropic()
with open("chart.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-5",
max_tokens=300,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64", "media_type": "image/png", "data": image_data}},
{"type": "text", "text": "What trend does this chart show? Be specific."},
],
}],
)
print(response.content[0].text)
Best Practices¶
- โ Downscale images to the model's recommended size โ huge images cost more tokens for no gain.
- โ For documents, combine OCR + layout detection with an LLM rather than relying on one model.
- โ Be explicit in prompts about what to extract ("return the total as a number").
Common Mistakes¶
- โ Sending full-resolution images and paying for tokens you don't need.
- โ Trusting OCR blindly on low-quality scans โ validate critical fields.
- โ Using image generation output commercially without checking the model's license.
๐ Help build this section¶
This section is early. We'd love contributions on the topics below โ claim one by opening an issue:
- โ Multimodal LLMs โ how image tokens work, cost, structured extraction ๐ก
- โ OCR โ Tesseract, preprocessing, OCRโLLM pipelines ๐ก
[WANTED]Object detection with YOLO โ a runnable example ๐ก[WANTED]Diffusion models explained โ how image generation works ๐ด[WANTED]Document QA example โ invoices/receipts to structured data ๐ก
See CONTRIBUTING.md and the content contract.
References¶
- Anthropic โ Vision
- Hugging Face โ Computer Vision course
- CLIP paper โ connecting text and images