Skip to content

Computer Vision & Multimodal

How modern AI sees โ€” from image understanding with multimodal LLMs to OCR, detection, and image generation.

Overview

"Vision AI" today spans two worlds that are rapidly merging:

  1. Multimodal LLMs โ€” models that accept images and text in the same prompt, so you can ask questions about a screenshot, a chart, or a photo in natural language.
  2. Specialized vision models โ€” OCR, object detection, segmentation, and image generation (diffusion) that do one thing extremely well.

Most applications combine them: a multimodal LLM for reasoning, a specialized model for the heavy lifting.

What this section covers

mindmap
  root((Vision))
    Multimodal LLMs
      Image + text prompts
      Visual question answering
      Document understanding
    OCR
      Printed & handwritten text
      Layout & tables
    Recognition
      Classification
      Object detection
      Segmentation
    Generation
      Diffusion models
      Image editing / inpainting

Learning Objectives

By the end of this section you will be able to:

  • Send an image to a multimodal LLM and reason about the response.
  • Choose between a multimodal LLM and a specialized model for a task.
  • Build a document-understanding pipeline (OCR โ†’ structure โ†’ LLM).
  • Understand how diffusion models generate images at a high level.

Quick taste: ask an LLM about an image

describe_image.py
import base64
from anthropic import Anthropic

client = Anthropic()

with open("chart.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-5",
    max_tokens=300,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {
                "type": "base64", "media_type": "image/png", "data": image_data}},
            {"type": "text", "text": "What trend does this chart show? Be specific."},
        ],
    }],
)
print(response.content[0].text)

Best Practices

  • โœ… Downscale images to the model's recommended size โ€” huge images cost more tokens for no gain.
  • โœ… For documents, combine OCR + layout detection with an LLM rather than relying on one model.
  • โœ… Be explicit in prompts about what to extract ("return the total as a number").

Common Mistakes

  • โŒ Sending full-resolution images and paying for tokens you don't need.
  • โŒ Trusting OCR blindly on low-quality scans โ€” validate critical fields.
  • โŒ Using image generation output commercially without checking the model's license.

๐Ÿ Help build this section

This section is early. We'd love contributions on the topics below โ€” claim one by opening an issue:

  • โœ… Multimodal LLMs โ€” how image tokens work, cost, structured extraction ๐ŸŸก
  • โœ… OCR โ€” Tesseract, preprocessing, OCRโ†’LLM pipelines ๐ŸŸก
  • [WANTED] Object detection with YOLO โ€” a runnable example ๐ŸŸก
  • [WANTED] Diffusion models explained โ€” how image generation works ๐Ÿ”ด
  • [WANTED] Document QA example โ€” invoices/receipts to structured data ๐ŸŸก

See CONTRIBUTING.md and the content contract.

References