What can multimodal AI models not do with images?

Current vision models cannot identify specific private individuals (by design, for privacy), read very small, rotated, or heavily obscured text reliably, perform pixel-precise measurements, access image metadata like GPS location or capture timestamp, or edit or modify the image file itself. For critical data like numbers in charts or text in dense tables, always verify against the original source — vision isn't perfect on complex visual layouts.

How do I get better results when combining an image with text context?

Provide the image for what it shows visually, and use text to supply the context the image can't convey — what the data represents, relevant business context, or what you need to decide. For example, a graph showing support ticket volume is more useful when you tell the model 'we have product releases every Tuesday and marketing campaigns on Mondays.' Without that context, the model can only describe the pattern; with it, it can explain and advise.

Multimodal Prompting: Images, Files, and Mixed Content

Q: What's the best way to prompt an AI model about an image?

Start with your purpose, not an open-ended 'what do you see?' Ask for what you actually need: 'I'm trying to reduce cart abandonment — identify any UX issues in this checkout screenshot that might cause users to leave before completing purchase.' Purpose-first prompting applies to images the same way it applies to text tasks. The model can describe anything in an image, so your job is to specify what dimension of analysis you need.

For most of AI's history, prompting meant typing text and getting text back. That's changed. Modern models like Claude 3 and GPT-4o can process images, PDFs, spreadsheets, and other file types alongside text — sometimes all at once.

This opens up a new set of use cases and a new set of prompting challenges. How you describe what you need is different when you're pointing at an image. What the model can and can't do varies by input type. And combining multiple input types requires thoughtful structure.

What Multimodal Models Can Process

Images:

Photographs (products, people, scenes, diagrams)
Screenshots of apps, websites, code, documents
Charts, graphs, infographics
Handwritten notes and whiteboard photos
UI/UX mockups and designs

Documents:

PDFs (text and image content)
Word documents
Spreadsheets (Excel, CSV)
Presentations

What they can't do with images:

Read text that's very small, rotated, or obscured
Identify specific private individuals (by design, for privacy)
Perform precise pixel-level measurements
Access metadata (when a photo was taken, GPS location, etc.)

Prompting for Image Analysis

The biggest mistake with image prompts is being too open-ended: "What do you see in this image?" produces a description, which is rarely what you need.

Start with your purpose:

For screenshots:

Here is a screenshot of our checkout page. I'm trying to reduce cart abandonment.
Identify any UX issues that might cause users to leave before completing purchase — 
confusing elements, missing trust signals, friction points, unclear CTAs.

For charts and graphs:

This is a chart of our monthly active users over the past 12 months.
What's the most important trend or pattern? What would you flag as worth investigating?

For product photos:

This is a photo of our competitor's product packaging. What design choices 
stand out? How does it position the product — premium, value, functional? 
What's working about the design?

For handwritten notes:

This is a photo of whiteboard notes from a strategy session. 
Transcribe all legible text, organized by section. Note anything that appears 
to be an action item (usually circled or underlined in my photos).

The "Show and Tell" Pattern

One of the most powerful multimodal prompting patterns: share an example visually and ask the model to analyze, replicate, or react to it.

Analyze a design:

[Attach image of a landing page]

This is a landing page that converts well for us — about 8% conversion rate. 
Analyze what design and copy elements are likely contributing to that performance.

Replicate a format:

[Attach image of a data visualization]

Create a textual template I could use to recreate this type of visualization 
in a presentation. Describe the structure, what data goes in each element, 
and what the key design choices are.

Iterate on a design:

[Attach image of your current design]

This is our current email newsletter template. The click-through rate is low. 
Without seeing what other newsletters look like, what structural or visual 
changes would you suggest to improve engagement?

Combining Image and Text Context

Sometimes the most useful prompt combines an image with text context the image alone doesn't provide:

[Attach image of a graph]

This graph shows our support ticket volume by day of week over Q4 2025.
Context: We have a scheduled product release every Tuesday. We run marketing 
campaigns primarily on Mondays and Thursdays.

Given this context, explain the pattern in the graph and suggest which days 
to staff our support team more heavily in Q1.

The image provides the visual data. The text context provides what the image can't show — the business context that makes the pattern interpretable.

Working With Data Files (CSVs and Spreadsheets)

When sharing structured data files, always describe the schema:

Here is a CSV of customer transactions from Q4 2025. Columns:
- customer_id: unique identifier
- purchase_date: YYYY-MM-DD format
- product_sku: product identifier
- amount: purchase amount in USD
- channel: acquisition channel (organic, paid_social, email, referral)

[paste CSV or attach file]

I want to understand which acquisition channels produce customers with 
the highest lifetime value. Calculate average total spend per customer 
by acquisition channel.

For attached files (Excel, CSV), include the same schema description even if the column headers are visible — it removes ambiguity and tells the model what the data represents, not just what it's called.

Multi-Image Prompts

Some interfaces let you attach multiple images in a single prompt. When doing this:

Label each image clearly:

Image 1 (current homepage):
[attach]

Image 2 (competitor A homepage):
[attach]

Image 3 (competitor B homepage):
[attach]

Compare the visual hierarchy and above-the-fold content across these three pages. 
Which communicates its value proposition most immediately? What is my page missing?

Reference images by label in your question — don't say "the first image" (order can be ambiguous) when you can say "Image 1" or "the homepage screenshot."

PDF Prompting

PDFs work well for text extraction and analysis. Tips:

For text-heavy PDFs:

This is a research paper as a PDF. 
Only analyze the content — don't comment on formatting or layout.

Focus on:
1. The central thesis
2. The methodology (how the study was conducted)
3. The three most important findings
4. Limitations acknowledged by the authors

For mixed PDFs (text + images/charts):

This report contains both text sections and data visualizations (charts and tables).
Analyze both the written content and the visual data.
When you reference a chart or figure, note which one you're referring to.

For contracts and legal documents:

This is an NDA. I'll be sharing confidential product information with a vendor.
Extract and explain in plain language:
- What I am and am not allowed to share
- How long the confidentiality obligation lasts
- What happens if there's a breach
Quote the specific clause for each answer.

Limitations to Know

Models don't have perfect vision. Complex infographics, dense financial tables, and multi-color charts with small text can be misread. For anything critical, verify important numbers against the source.

Context limits apply to images too. Very high-resolution images or many images in a single conversation consume significant context. If you're running long sessions with multiple images, you may hit limits.

No personal identification. Models are designed not to identify specific individuals in photos. For use cases involving people, you'll need to describe who they are in text.

Models can't write to images. They can analyze, describe, and suggest changes to images — they can't edit the image file itself.

Key Takeaways

Purpose-first applies to images too — "what do you see?" is usually the weakest prompt
The "show and tell" pattern is powerful: share an example and ask for analysis or replication
Combine image input with text context to fill in what the image can't convey
For data files, always describe the schema even if column names are visible
Verify important information extracted from images — vision isn't perfect

You've now completed the core Intermediate Track techniques. The next lesson brings everything together: using system prompts, context, structured inputs, and advanced control to build reliable, consistent AI workflows. Advanced Track →

What Multimodal Models Can Process

Images:

Photographs (products, people, scenes, diagrams)
Screenshots of apps, websites, code, documents
Charts, graphs, infographics
Handwritten notes and whiteboard photos
UI/UX mockups and designs

Documents:

PDFs (text and image content)
Word documents
Spreadsheets (Excel, CSV)
Presentations

What they can't do with images:

Read text that's very small, rotated, or obscured
Identify specific private individuals (by design, for privacy)
Perform precise pixel-level measurements
Access metadata (when a photo was taken, GPS location, etc.)

Prompting for Image Analysis

The biggest mistake with image prompts is being too open-ended: "What do you see in this image?" produces a description, which is rarely what you need.

Start with your purpose:

For screenshots:

Here is a screenshot of our checkout page. I'm trying to reduce cart abandonment.
Identify any UX issues that might cause users to leave before completing purchase — 
confusing elements, missing trust signals, friction points, unclear CTAs.

For charts and graphs:

This is a chart of our monthly active users over the past 12 months.
What's the most important trend or pattern? What would you flag as worth investigating?

For product photos:

This is a photo of our competitor's product packaging. What design choices 
stand out? How does it position the product — premium, value, functional? 
What's working about the design?

For handwritten notes:

This is a photo of whiteboard notes from a strategy session. 
Transcribe all legible text, organized by section. Note anything that appears 
to be an action item (usually circled or underlined in my photos).

The "Show and Tell" Pattern

One of the most powerful multimodal prompting patterns: share an example visually and ask the model to analyze, replicate, or react to it.

Analyze a design:

[Attach image of a landing page]

This is a landing page that converts well for us — about 8% conversion rate. 
Analyze what design and copy elements are likely contributing to that performance.

Replicate a format:

[Attach image of a data visualization]

Create a textual template I could use to recreate this type of visualization 
in a presentation. Describe the structure, what data goes in each element, 
and what the key design choices are.

Iterate on a design:

[Attach image of your current design]

This is our current email newsletter template. The click-through rate is low. 
Without seeing what other newsletters look like, what structural or visual 
changes would you suggest to improve engagement?

Combining Image and Text Context

Sometimes the most useful prompt combines an image with text context the image alone doesn't provide:

[Attach image of a graph]

This graph shows our support ticket volume by day of week over Q4 2025.
Context: We have a scheduled product release every Tuesday. We run marketing 
campaigns primarily on Mondays and Thursdays.

Given this context, explain the pattern in the graph and suggest which days 
to staff our support team more heavily in Q1.

The image provides the visual data. The text context provides what the image can't show — the business context that makes the pattern interpretable.

Working With Data Files (CSVs and Spreadsheets)

When sharing structured data files, always describe the schema:

Here is a CSV of customer transactions from Q4 2025. Columns:
- customer_id: unique identifier
- purchase_date: YYYY-MM-DD format
- product_sku: product identifier
- amount: purchase amount in USD
- channel: acquisition channel (organic, paid_social, email, referral)

[paste CSV or attach file]

I want to understand which acquisition channels produce customers with 
the highest lifetime value. Calculate average total spend per customer 
by acquisition channel.

Multi-Image Prompts

Some interfaces let you attach multiple images in a single prompt. When doing this:

Label each image clearly:

Image 1 (current homepage):
[attach]

Image 2 (competitor A homepage):
[attach]

Image 3 (competitor B homepage):
[attach]

Compare the visual hierarchy and above-the-fold content across these three pages. 
Which communicates its value proposition most immediately? What is my page missing?

Reference images by label in your question — don't say "the first image" (order can be ambiguous) when you can say "Image 1" or "the homepage screenshot."

PDF Prompting

PDFs work well for text extraction and analysis. Tips:

For text-heavy PDFs:

This is a research paper as a PDF. 
Only analyze the content — don't comment on formatting or layout.

Focus on:
1. The central thesis
2. The methodology (how the study was conducted)
3. The three most important findings
4. Limitations acknowledged by the authors

For mixed PDFs (text + images/charts):

This report contains both text sections and data visualizations (charts and tables).
Analyze both the written content and the visual data.
When you reference a chart or figure, note which one you're referring to.

For contracts and legal documents:

This is an NDA. I'll be sharing confidential product information with a vendor.
Extract and explain in plain language:
- What I am and am not allowed to share
- How long the confidentiality obligation lasts
- What happens if there's a breach
Quote the specific clause for each answer.

Limitations to Know

No personal identification. Models are designed not to identify specific individuals in photos. For use cases involving people, you'll need to describe who they are in text.

Models can't write to images. They can analyze, describe, and suggest changes to images — they can't edit the image file itself.

Key Takeaways

Purpose-first applies to images too — "what do you see?" is usually the weakest prompt
The "show and tell" pattern is powerful: share an example and ask for analysis or replication
Combine image input with text context to fill in what the image can't convey
For data files, always describe the schema even if column names are visible
Verify important information extracted from images — vision isn't perfect

Multimodal Prompting: Working with Images, Files, and Mixed Content

What Multimodal Models Can Process

Prompting for Image Analysis

The "Show and Tell" Pattern

Combining Image and Text Context

Working With Data Files (CSVs and Spreadsheets)

Multi-Image Prompts

PDF Prompting

Limitations to Know

Key Takeaways

Multimodal Prompting: Working with Images, Files, and Mixed Content

What Multimodal Models Can Process

Prompting for Image Analysis

The "Show and Tell" Pattern

Combining Image and Text Context

Working With Data Files (CSVs and Spreadsheets)

Multi-Image Prompts

PDF Prompting

Limitations to Know

Key Takeaways