Most developers interact with Claude through text. That's leaving a significant capability on the table.
Claude's vision API lets you send images in any messages call — no separate endpoint, no special SDK, no additional setup beyond what you're already using. The same client you use for text completions handles images. You pass an image alongside your text prompt, and Claude reads them together.
This guide covers the Claude Vision API end to end: how to send images (both URL and base64), prompt patterns for the tasks that actually come up in production (OCR, classification, chart reading, defect detection), batch processing with structured output, cost calculations, and a comparison to GPT-4o Vision and Gemini.
What Claude Vision can handle
Before the code: supported formats are JPEG, PNG, GIF, and WebP. Maximum 20MB per image, up to 20 images per API request. Claude processes images at their native resolution up to its internal limits — you don't need to resize before sending in most cases.
Claude Vision works well for:
- Extracting text from scanned documents, screenshots, and photos (OCR)
- Classifying document types (invoice vs receipt vs contract)
- Reading data from charts, graphs, and tables
- Describing products or scenes
- Comparing before/after images
- Analyzing UI screenshots for UX issues
- Detecting damage, defects, or anomalies in product photos
What it can't do: generate images. Claude is an analysis model. For generation, you need Imagen, DALL-E, or Stable Diffusion.
Sending images: URL vs base64
There are two ways to pass an image to the Claude Vision API.
Method 1: URL — for publicly accessible images. Faster to write, no encoding overhead:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/product.jpg"
}
},
{
"type": "text",
"text": "What product is this? Extract name, color, and any visible price."
}
]
}]
)
print(response.content[0].text)
Method 2: Base64 — for local files, private images, or anything that isn't publicly accessible:
import anthropic, base64
from pathlib import Path
client = anthropic.Anthropic()
image_data = base64.standard_b64encode(Path("invoice.png").read_bytes()).decode()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data
}
},
{
"type": "text",
"text": "Extract all text from this invoice."
}
]
}]
)
Use image/jpeg for JPG files, image/png for PNG, image/gif for GIF, and image/webp for WebP. The media_type must match the actual file format — mismatches cause silent parsing failures.
Prompt patterns by task
The image is only half the input. The prompt matters just as much as it does for text-only requests. Here are the patterns that work for the most common vision tasks.
OCR and text extraction
Extract all text from this image exactly as it appears, preserving the original formatting.
If you see a table, output it as markdown with | separators.
If there are multiple sections (header, body, footer), label them clearly.
If any text is partially obscured or unclear, include your best reading followed by [?].
The explicit formatting instruction prevents Claude from summarizing instead of transcribing — a common failure mode when the prompt is just "extract text."
Document classification
Classify this document as one of: invoice, receipt, purchase order, contract, bank statement,
form, report, or other.
Return valid JSON in this format:
{
"type": "invoice",
"confidence": "high",
"key_identifiers": ["invoice number visible", "line items with prices", "due date present"]
}
Use "high", "medium", or "low" for confidence based on how clearly the document matches the type.
Asking for key_identifiers forces Claude to ground its classification in specific visual evidence rather than guessing.
Product cataloging
Analyze this product image and extract details in JSON format:
{
"name": "product name or best description",
"brand": "visible brand or null",
"color": "primary color(s)",
"size_visible": "any size information visible on packaging",
"price_visible": "price if shown, else null",
"condition": "new, used, or unclear"
}
If a field isn't visible or determinable, use null — don't guess.
Chart and graph reading
Describe the data in this chart precisely. Extract:
- Chart type (bar, line, pie, scatter, etc.)
- X and Y axis labels and units
- All data series names and their colors/patterns
- Approximate values at: highest point, lowest point, most recent point (if time series)
- The main trend or insight the chart communicates
If values are approximate, say so. Don't round unless the chart rounds.
This prompt works on screenshots of Excel charts, embedded analytics dashboards, and published data visualizations. Claude handles axis reading surprisingly well even on cluttered charts.
Damage and defect detection
Inspect this image for damage, defects, or quality issues.
For each issue found, provide:
- Location: describe in plain English (e.g., "top-right corner", "center of the surface")
- Type: what kind of damage or defect
- Severity: minor (cosmetic only), moderate (functional impact possible), severe (clearly defective)
- Recommended action: accept, flag for review, or reject
If no issues are found, say "No defects detected" with a brief description of what you examined.
UI screenshot analysis
This is a screenshot of a web page or app UI. Analyze it and identify:
1. Main call-to-action: what action is the page primarily asking users to take?
2. Navigation: list the main navigation items visible
3. Error states: any visible error messages, broken elements, or missing images
4. UX issues: anything that looks confusing, inaccessible, or inconsistent with good design
5. Content: what is the page primarily about?
This is useful for automated visual QA — run it against screenshots from your test suite to catch visual regressions that unit tests miss.
Multi-image comparison
Sending multiple images in one request lets Claude reason across them — before/after comparisons, product variants, document versions:
before_b64 = base64.standard_b64encode(Path("before.jpg").read_bytes()).decode()
after_b64 = base64.standard_b64encode(Path("after.jpg").read_bytes()).decode()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Image 1 (before renovation):"},
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": before_b64}
},
{"type": "text", "text": "Image 2 (after renovation):"},
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": after_b64}
},
{
"type": "text",
"text": "What specifically changed between these two images? List each change you can identify."
}
]
}]
)
Label each image explicitly in the text. When Claude processes multiple images, labeling ("Image 1:", "Image 2:") makes references in the response unambiguous. Without labels, "the image on the left" doesn't mean anything in an API response.
Batch processing with structured output
For processing large volumes of images — product catalogs, document archives, screenshot libraries — use Claude Haiku instead of Sonnet. Same API, significantly lower cost:
import anthropic, base64, json
from pathlib import Path
from pydantic import BaseModel
client = anthropic.Anthropic()
class ProductInfo(BaseModel):
name: str
category: str
has_price_tag: bool
dominant_colors: list[str]
quality_issues: list[str]
def analyze_product_image(image_path: str) -> ProductInfo:
path = Path(image_path)
image_data = base64.standard_b64encode(path.read_bytes()).decode()
ext = path.suffix.lower().lstrip(".")
media_type = f"image/{'jpeg' if ext == 'jpg' else ext}"
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Haiku for cost-efficient batch processing
max_tokens=512,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": media_type, "data": image_data}
},
{
"type": "text",
"text": f"""Analyze this product image. Return valid JSON matching this schema:
{json.dumps(ProductInfo.model_json_schema(), indent=2)}
For quality_issues, list any visible defects, damage, or quality problems.
Empty list if none found."""
}
]
}]
)
return ProductInfo.model_validate_json(response.content[0].text)
# Process a directory of product images
def batch_analyze(image_dir: str) -> list[dict]:
results = []
for image_path in Path(image_dir).glob("*.jpg"):
try:
info = analyze_product_image(str(image_path))
results.append({"file": image_path.name, **info.model_dump()})
except Exception as e:
results.append({"file": image_path.name, "error": str(e)})
return results
For very large batches (thousands of images), look at the Anthropic Batch API — it processes requests asynchronously at 50% lower cost.
Understanding vision costs
Images are billed as tokens. The token count depends on image dimensions:
India developers: AICredits lets you call the Claude Vision API with INR / UPI billing — useful for bulk image processing workloads billed in rupees.
- A 1024×1024 image ≈ 1,600 input tokens
- A 512×512 image ≈ 400 input tokens
- A 2048×2048 image ≈ 6,400 input tokens
At Claude Sonnet 4.6 pricing ($3/1M input tokens):
- 1024×1024 image: ~$0.005 per image
- Processing 1,000 images: ~$5
At Claude Haiku pricing ($0.80/1M input tokens):
- 1024×1024 image: ~$0.001 per image
- Processing 1,000 images: ~$1.30
Rule of thumb: use Haiku for classification, OCR, and structured extraction at scale. Use Sonnet when the task requires more nuanced reasoning — complex chart analysis, detailed defect descriptions, comparing multiple images.
Claude Vision vs GPT-4o Vision vs Gemini Flash Vision
| Capability | Claude Sonnet 4.6 | GPT-4o | Gemini Flash |
|---|---|---|---|
| Images per request | Up to 20 | Up to 10 | Up to 16 |
| Max image size | 20MB | 20MB | 20MB |
| OCR quality | Excellent | Excellent | Very good |
| Chart and graph reading | Excellent | Good | Good |
| Context window (with images) | 200K tokens | 128K tokens | 1M tokens |
| Cost per 1K images (budget tier) | ~$1.30 (Haiku) | ~$1.50 (4o-mini) | ~$0.40 (Flash) |
| Structured JSON output | Strong | Strong | Good |
Gemini Flash wins on cost and context window. Claude wins on chart reading and complex reasoning tasks. GPT-4o is in the middle on most dimensions. For document processing pipelines where accuracy matters, Claude's edge on OCR and structured extraction usually justifies the cost difference over Gemini.
The multimodal prompting lesson covers the principles behind effective vision prompts in more depth — the same patterns that make text prompts more precise apply equally to image analysis.
Practical patterns worth knowing
Pre-process images when size matters. A 20MB RAW camera file takes much longer to encode and transmit than a compressed JPEG at equivalent visual quality. Resize to 2048px on the long edge before sending — you won't lose meaningful visual information for most tasks.
Include context the image doesn't show. Claude only knows what's in the image and what you tell it. For invoice processing, add "This invoice is from vendor [NAME] for services in [MONTH]" if that context exists. For damage detection, add "This product was shipped from [LOCATION] and the customer reports damage to the outer packaging."
Ask for structured output by default. Unstructured image descriptions are hard to parse programmatically. JSON with explicit field names is almost always more useful for downstream processing. Define the schema in the prompt as shown in the batch example above.
Validate the output. Vision outputs can have subtle errors — misread numbers, confused units, hallucinated text in areas that are actually blank. For high-stakes applications (financial documents, medical images, legal contracts), add a validation step: ask Claude to review its own extraction against specific fields, or cross-reference with known values.
For end-to-end document processing pipelines — combining vision with structured extraction and downstream actions — the document processing agent guide covers building production workflows around these same API primitives. The instructor library guide is also useful if you want more robust schema validation than raw JSON parsing.
Claude's 200K context window is particularly valuable for multi-page document workflows: you can send up to 20 images representing different pages of the same document and ask Claude to reason across the full document — something that requires multiple API calls with smaller-context models.



