Multimodal AI is one of the most practically useful advances in the last two years. You can drop a screenshot of a broken UI into Claude or GPT-4o and get a diagnosis. You can paste a chart and ask what story the data tells. You can upload a scanned document and have it extracted into structured JSON. But vision models have real limits, and if you don't know them, you'll waste time on prompts that were never going to work.
Vision models reward specific, task-oriented prompts even more than text-only models do. The image is rich context; your prompt is the lens that focuses the model's attention on what matters.
What vision models are actually good at
Vision models are surprisingly strong at reading text embedded in images — error messages, code screenshots, UI labels, document headers. They're also solid at:
- Interpreting charts and graphs: trend direction, comparisons, outliers
- Understanding UI layouts: what's clickable, what's broken, what's missing
- Describing diagrams: flowcharts, architecture diagrams, ER diagrams
- Document structure: headings, tables, bullet points from scanned PDFs
- Object recognition and scene description: photos, product images
Where they fall apart:
- Counting precisely: "how many dots on this image?" will often be wrong
- Fine spatial detail: exact pixel positions, small overlapping elements
- Very small text: anything under ~12pt in a standard screenshot
- Highly compressed or blurry images: JPEG artifacts confuse the model
- Colors in low-contrast environments: dark-on-dark, similar-hue comparisons
Knowing this shapes how you prompt. If you need to count items or pinpoint exact coordinates, do that in code — not with a vision model.
Image quality matters more than you think
The model can only work with what you give it. Three things kill vision output quality:
Resolution: A 200×200 thumbnail of a dashboard is almost useless. Aim for screenshots at normal display resolution (1x or 2x retina). When in doubt, bigger is better.
Compression: Heavy JPEG compression creates artifacts that confuse OCR and text reading. If you're sharing a screenshot of code or an error message, use PNG.
Crops: Cropping too tight removes context that helps the model interpret the image. If you're showing a chart, include axis labels. If you're showing a UI bug, include enough surrounding UI that the model understands what it's looking at.
Prompt patterns by image type
Screenshots (UI, errors, dashboards)
The most common mistake with screenshots is asking "what do you see?" That produces a generic description when you want a specific answer. Be precise about what you want.
Bad: "What do you see in this screenshot?"
Good: "This is a screenshot of our checkout page. The 'Place Order' button is not
responding to clicks. What in this UI might explain that behavior? Focus on
any overlapping elements, z-index issues, or disabled states you can see."
For error messages and stack traces in screenshots:
This is a screenshot of a browser console error. Extract the full error message
and stack trace as plain text, then explain the most likely root cause based on
the error type and the file paths shown.
Charts and graphs
Don't ask for the data — ask for the story. Vision models are unreliable at reading precise values off axes, but they're good at interpreting trends and patterns.
Bad: "What are the exact values in this bar chart?"
Good: "This bar chart shows monthly revenue for Q1–Q4 2024. What is the overall
trend, which quarter stands out as an anomaly, and what might explain the dip
in month 8?"
If you do need specific numbers extracted, ask the model to be explicit about its confidence:
Extract the approximate values from this line graph. For each data point,
indicate your confidence (high/medium/low) based on how clearly readable
the value is. Flag anything where the axis scale makes precision difficult.
Documents and PDFs
Vision models can process document images well when you structure the extraction request:
This is a scanned page from a legal contract. Extract the following in JSON:
- Section headings (as an array)
- Any dates mentioned (ISO format)
- Party names (as "party_a" and "party_b")
- Any monetary amounts mentioned
If a field is not present on this page, return null for that key.
For multi-page documents, process page by page and accumulate:
This is page 3 of a 10-page report. Summarize only the content on this page
in 3-5 bullet points. Focus on new information introduced on this page —
don't repeat context from earlier pages.
Diagrams and technical drawings
The goal with diagrams is usually to understand relationships, not just enumerate components. Prompt for relationships explicitly:
This is a system architecture diagram. Describe:
1. The main components and their purpose (1 sentence each)
2. How data flows between them (what triggers what)
3. Any single points of failure you can identify
4. What's missing that you'd expect in a production architecture
For ER diagrams or database schemas:
This is an entity-relationship diagram. Identify:
- All entities (table names)
- The primary key for each entity
- All relationships and their cardinality (one-to-many, many-to-many, etc.)
- Any junction/bridge tables
Output this as a structured list, not prose.
Photos
For photos where you need description, context about the use case dramatically improves output:
Bad: "Describe this photo."
Good: "This photo will be used as alt text for an e-commerce product listing
for a visually impaired user. Describe the product shown — its color, material,
shape, size (relative to other objects in frame), and any distinguishing features.
Be specific and factual, not promotional."
Sending images to Claude: technical details
Claude accepts images in two ways via the Messages API: base64-encoded data or public URLs.
Base64 (works offline, no external requests):
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": "<base64-encoded-image-data>"
}
}
URL (simpler, but the URL must be publicly accessible):
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/screenshot.png"
}
}
Claude's current limits: up to 20 images per request, max ~5MB per image after base64 encoding. For very large images, resize before sending — anything above 1568px on the longest side gets resized internally anyway, so you're paying token cost without quality gain.
Chaining vision with text tasks
The real power of vision models comes when you chain image analysis into a downstream text task. The pattern:
- Step 1: Use the vision model to extract structured information from the image
- Step 2: Feed that extracted information into a text-only prompt for reasoning, formatting, or generation
Example — turning a screenshot into a bug report:
Step 1 prompt (vision):
"This is a screenshot of a web application showing an error state. Extract:
- The exact error message displayed
- The URL shown in the browser bar
- Any form fields visible and their current state
- The timestamp if visible
Output as JSON."
Step 2 prompt (text, fed the JSON from step 1):
"Using this extracted data: [JSON from step 1]
Write a bug report in this format:
**Summary**: One-line description
**Steps to reproduce**: Numbered list
**Expected behavior**: What should happen
**Actual behavior**: What happened instead
**Environment**: Fill in from URL/context clues where possible"
This two-step approach is more reliable than asking the model to do both in one shot, and it lets you inspect the intermediate extraction before using it.
Common mistakes
Vague questions on specific images: "What do you think of this design?" gives you a generic response. "What accessibility issues do you see in this form?" gives you something actionable.
Sending the wrong image type for the task: Screenshots of code are fine, but for code review, paste the actual code. Vision models can read code from images, but they're slower and less reliable than working from text directly.
Over-relying on vision for precision: If you need exact numbers from a chart, scrape the underlying data instead. Vision is for interpretation, not measurement.
Ignoring the model's uncertainty: When a vision model hedges ("the text is partially cut off, but appears to be..."), that's information. Don't ignore qualifications — they tell you where to verify.
Too many questions in one prompt: Vision prompts with five separate questions often produce thin answers to each. Ask the most important question first, then follow up.
Vision models have matured fast. Used well — with clear, specific prompts and appropriate image quality — they save hours of manual analysis work. Used lazily, they produce generic descriptions that tell you less than looking at the image yourself.