What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Claude Vision API — Complete Guide to Image Analysis and Understanding

Most developers interact with Claude through text. That's leaving a significant capability on the table.

Claude's vision API lets you send images in any messages call — no separate endpoint, no special SDK, no additional setup beyond what you're already using. The same client you use for text completions handles images. You pass an image alongside your text prompt, and Claude reads them together.

This guide covers the Claude Vision API end to end: how to send images (both URL and base64), prompt patterns for the tasks that actually come up in production (OCR, classification, chart reading, defect detection), batch processing with structured output, cost calculations, and a comparison to GPT-4o Vision and Gemini.

What Claude Vision can handle

Before the code: supported formats are JPEG, PNG, GIF, and WebP. Maximum 20MB per image, up to 20 images per API request. Claude processes images at their native resolution up to its internal limits — you don't need to resize before sending in most cases.

Claude Vision works well for:

Extracting text from scanned documents, screenshots, and photos (OCR)
Classifying document types (invoice vs receipt vs contract)
Reading data from charts, graphs, and tables
Describing products or scenes
Comparing before/after images
Analyzing UI screenshots for UX issues
Detecting damage, defects, or anomalies in product photos

What it can't do: generate images. Claude is an analysis model. For generation, you need Imagen, DALL-E, or Stable Diffusion.

Sending images: URL vs base64

There are two ways to pass an image to the Claude Vision API.

Method 1: URL — for publicly accessible images. Faster to write, no encoding overhead:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/product.jpg"
                }
            },
            {
                "type": "text",
                "text": "What product is this? Extract name, color, and any visible price."
            }
        ]
    }]
)

print(response.content[0].text)

Method 2: Base64 — for local files, private images, or anything that isn't publicly accessible:

import anthropic, base64
from pathlib import Path

client = anthropic.Anthropic()

image_data = base64.standard_b64encode(Path("invoice.png").read_bytes()).decode()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }
            },
            {
                "type": "text",
                "text": "Extract all text from this invoice."
            }
        ]
    }]
)

Use image/jpeg for JPG files, image/png for PNG, image/gif for GIF, and image/webp for WebP. The media_type must match the actual file format — mismatches cause silent parsing failures.

Prompt patterns by task

The image is only half the input. The prompt matters just as much as it does for text-only requests. Here are the patterns that work for the most common vision tasks.

OCR and text extraction

Extract all text from this image exactly as it appears, preserving the original formatting. 
If you see a table, output it as markdown with | separators. 
If there are multiple sections (header, body, footer), label them clearly.
If any text is partially obscured or unclear, include your best reading followed by [?].

The explicit formatting instruction prevents Claude from summarizing instead of transcribing — a common failure mode when the prompt is just "extract text."

Document classification

Classify this document as one of: invoice, receipt, purchase order, contract, bank statement, 
form, report, or other.

Return valid JSON in this format:
{
  "type": "invoice",
  "confidence": "high",
  "key_identifiers": ["invoice number visible", "line items with prices", "due date present"]
}

Use "high", "medium", or "low" for confidence based on how clearly the document matches the type.

Asking for key_identifiers forces Claude to ground its classification in specific visual evidence rather than guessing.

Product cataloging

Analyze this product image and extract details in JSON format:
{
  "name": "product name or best description",
  "brand": "visible brand or null",
  "color": "primary color(s)",
  "size_visible": "any size information visible on packaging",
  "price_visible": "price if shown, else null",
  "condition": "new, used, or unclear"
}

If a field isn't visible or determinable, use null — don't guess.

Chart and graph reading

Describe the data in this chart precisely. Extract:
- Chart type (bar, line, pie, scatter, etc.)
- X and Y axis labels and units
- All data series names and their colors/patterns
- Approximate values at: highest point, lowest point, most recent point (if time series)
- The main trend or insight the chart communicates

If values are approximate, say so. Don't round unless the chart rounds.

This prompt works on screenshots of Excel charts, embedded analytics dashboards, and published data visualizations. Claude handles axis reading surprisingly well even on cluttered charts.

Damage and defect detection

Inspect this image for damage, defects, or quality issues.

For each issue found, provide:
- Location: describe in plain English (e.g., "top-right corner", "center of the surface")  
- Type: what kind of damage or defect
- Severity: minor (cosmetic only), moderate (functional impact possible), severe (clearly defective)
- Recommended action: accept, flag for review, or reject

If no issues are found, say "No defects detected" with a brief description of what you examined.

UI screenshot analysis

This is a screenshot of a web page or app UI. Analyze it and identify:
1. Main call-to-action: what action is the page primarily asking users to take?
2. Navigation: list the main navigation items visible
3. Error states: any visible error messages, broken elements, or missing images
4. UX issues: anything that looks confusing, inaccessible, or inconsistent with good design
5. Content: what is the page primarily about?

This is useful for automated visual QA — run it against screenshots from your test suite to catch visual regressions that unit tests miss.

Multi-image comparison

Sending multiple images in one request lets Claude reason across them — before/after comparisons, product variants, document versions:

before_b64 = base64.standard_b64encode(Path("before.jpg").read_bytes()).decode()
after_b64 = base64.standard_b64encode(Path("after.jpg").read_bytes()).decode()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Image 1 (before renovation):"},
            {
                "type": "image",
                "source": {"type": "base64", "media_type": "image/jpeg", "data": before_b64}
            },
            {"type": "text", "text": "Image 2 (after renovation):"},
            {
                "type": "image",
                "source": {"type": "base64", "media_type": "image/jpeg", "data": after_b64}
            },
            {
                "type": "text",
                "text": "What specifically changed between these two images? List each change you can identify."
            }
        ]
    }]
)

Label each image explicitly in the text. When Claude processes multiple images, labeling ("Image 1:", "Image 2:") makes references in the response unambiguous. Without labels, "the image on the left" doesn't mean anything in an API response.

Batch processing with structured output

For processing large volumes of images — product catalogs, document archives, screenshot libraries — use Claude Haiku instead of Sonnet. Same API, significantly lower cost:

import anthropic, base64, json
from pathlib import Path
from pydantic import BaseModel

client = anthropic.Anthropic()

class ProductInfo(BaseModel):
    name: str
    category: str
    has_price_tag: bool
    dominant_colors: list[str]
    quality_issues: list[str]

def analyze_product_image(image_path: str) -> ProductInfo:
    path = Path(image_path)
    image_data = base64.standard_b64encode(path.read_bytes()).decode()
    
    ext = path.suffix.lower().lstrip(".")
    media_type = f"image/{'jpeg' if ext == 'jpg' else ext}"
    
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Haiku for cost-efficient batch processing
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": media_type, "data": image_data}
                },
                {
                    "type": "text",
                    "text": f"""Analyze this product image. Return valid JSON matching this schema:
{json.dumps(ProductInfo.model_json_schema(), indent=2)}

For quality_issues, list any visible defects, damage, or quality problems. 
Empty list if none found."""
                }
            ]
        }]
    )
    
    return ProductInfo.model_validate_json(response.content[0].text)

# Process a directory of product images
def batch_analyze(image_dir: str) -> list[dict]:
    results = []
    for image_path in Path(image_dir).glob("*.jpg"):
        try:
            info = analyze_product_image(str(image_path))
            results.append({"file": image_path.name, **info.model_dump()})
        except Exception as e:
            results.append({"file": image_path.name, "error": str(e)})
    return results

For very large batches (thousands of images), look at the Anthropic Batch API — it processes requests asynchronously at 50% lower cost.

Understanding vision costs

Images are billed as tokens. The token count depends on image dimensions:

India developers: AICredits lets you call the Claude Vision API with INR / UPI billing — useful for bulk image processing workloads billed in rupees.

A 1024×1024 image ≈ 1,600 input tokens
A 512×512 image ≈ 400 input tokens
A 2048×2048 image ≈ 6,400 input tokens

At Claude Sonnet 4.6 pricing ($3/1M input tokens):

1024×1024 image: ~$0.005 per image
Processing 1,000 images: ~$5

At Claude Haiku pricing ($0.80/1M input tokens):

1024×1024 image: ~$0.001 per image
Processing 1,000 images: ~$1.30

Rule of thumb: use Haiku for classification, OCR, and structured extraction at scale. Use Sonnet when the task requires more nuanced reasoning — complex chart analysis, detailed defect descriptions, comparing multiple images.

Claude Vision vs GPT-4o Vision vs Gemini Flash Vision

Capability	Claude Sonnet 4.6	GPT-4o	Gemini Flash
Images per request	Up to 20	Up to 10	Up to 16
Max image size	20MB	20MB	20MB
OCR quality	Excellent	Excellent	Very good
Chart and graph reading	Excellent	Good	Good
Context window (with images)	200K tokens	128K tokens	1M tokens
Cost per 1K images (budget tier)	~$1.30 (Haiku)	~$1.50 (4o-mini)	~$0.40 (Flash)
Structured JSON output	Strong	Strong	Good

Gemini Flash wins on cost and context window. Claude wins on chart reading and complex reasoning tasks. GPT-4o is in the middle on most dimensions. For document processing pipelines where accuracy matters, Claude's edge on OCR and structured extraction usually justifies the cost difference over Gemini.

The multimodal prompting lesson covers the principles behind effective vision prompts in more depth — the same patterns that make text prompts more precise apply equally to image analysis.

Practical patterns worth knowing

Pre-process images when size matters. A 20MB RAW camera file takes much longer to encode and transmit than a compressed JPEG at equivalent visual quality. Resize to 2048px on the long edge before sending — you won't lose meaningful visual information for most tasks.

Include context the image doesn't show. Claude only knows what's in the image and what you tell it. For invoice processing, add "This invoice is from vendor [NAME] for services in [MONTH]" if that context exists. For damage detection, add "This product was shipped from [LOCATION] and the customer reports damage to the outer packaging."

Ask for structured output by default. Unstructured image descriptions are hard to parse programmatically. JSON with explicit field names is almost always more useful for downstream processing. Define the schema in the prompt as shown in the batch example above.

Validate the output. Vision outputs can have subtle errors — misread numbers, confused units, hallucinated text in areas that are actually blank. For high-stakes applications (financial documents, medical images, legal contracts), add a validation step: ask Claude to review its own extraction against specific fields, or cross-reference with known values.

For end-to-end document processing pipelines — combining vision with structured extraction and downstream actions — the document processing agent guide covers building production workflows around these same API primitives. The instructor library guide is also useful if you want more robust schema validation than raw JSON parsing.

Claude's 200K context window is particularly valuable for multi-page document workflows: you can send up to 20 images representing different pages of the same document and ask Claude to reason across the full document — something that requires multiple API calls with smaller-context models.

Most developers interact with Claude through text. That's leaving a significant capability on the table.

What Claude Vision can handle

Claude Vision works well for:

Extracting text from scanned documents, screenshots, and photos (OCR)
Classifying document types (invoice vs receipt vs contract)
Reading data from charts, graphs, and tables
Describing products or scenes
Comparing before/after images
Analyzing UI screenshots for UX issues
Detecting damage, defects, or anomalies in product photos

What it can't do: generate images. Claude is an analysis model. For generation, you need Imagen, DALL-E, or Stable Diffusion.

Sending images: URL vs base64

There are two ways to pass an image to the Claude Vision API.

Method 1: URL — for publicly accessible images. Faster to write, no encoding overhead:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/product.jpg"
                }
            },
            {
                "type": "text",
                "text": "What product is this? Extract name, color, and any visible price."
            }
        ]
    }]
)

print(response.content[0].text)

Method 2: Base64 — for local files, private images, or anything that isn't publicly accessible:

import anthropic, base64
from pathlib import Path

client = anthropic.Anthropic()

image_data = base64.standard_b64encode(Path("invoice.png").read_bytes()).decode()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }
            },
            {
                "type": "text",
                "text": "Extract all text from this invoice."
            }
        ]
    }]
)

Use image/jpeg for JPG files, image/png for PNG, image/gif for GIF, and image/webp for WebP. The media_type must match the actual file format — mismatches cause silent parsing failures.

Prompt patterns by task

The image is only half the input. The prompt matters just as much as it does for text-only requests. Here are the patterns that work for the most common vision tasks.

OCR and text extraction

Extract all text from this image exactly as it appears, preserving the original formatting. 
If you see a table, output it as markdown with | separators. 
If there are multiple sections (header, body, footer), label them clearly.
If any text is partially obscured or unclear, include your best reading followed by [?].

The explicit formatting instruction prevents Claude from summarizing instead of transcribing — a common failure mode when the prompt is just "extract text."

Document classification

Classify this document as one of: invoice, receipt, purchase order, contract, bank statement, 
form, report, or other.

Return valid JSON in this format:
{
  "type": "invoice",
  "confidence": "high",
  "key_identifiers": ["invoice number visible", "line items with prices", "due date present"]
}

Use "high", "medium", or "low" for confidence based on how clearly the document matches the type.

Asking for key_identifiers forces Claude to ground its classification in specific visual evidence rather than guessing.

Product cataloging

Analyze this product image and extract details in JSON format:
{
  "name": "product name or best description",
  "brand": "visible brand or null",
  "color": "primary color(s)",
  "size_visible": "any size information visible on packaging",
  "price_visible": "price if shown, else null",
  "condition": "new, used, or unclear"
}

If a field isn't visible or determinable, use null — don't guess.

Chart and graph reading

Describe the data in this chart precisely. Extract:
- Chart type (bar, line, pie, scatter, etc.)
- X and Y axis labels and units
- All data series names and their colors/patterns
- Approximate values at: highest point, lowest point, most recent point (if time series)
- The main trend or insight the chart communicates

If values are approximate, say so. Don't round unless the chart rounds.

This prompt works on screenshots of Excel charts, embedded analytics dashboards, and published data visualizations. Claude handles axis reading surprisingly well even on cluttered charts.

Damage and defect detection

Inspect this image for damage, defects, or quality issues.

For each issue found, provide:
- Location: describe in plain English (e.g., "top-right corner", "center of the surface")  
- Type: what kind of damage or defect
- Severity: minor (cosmetic only), moderate (functional impact possible), severe (clearly defective)
- Recommended action: accept, flag for review, or reject

If no issues are found, say "No defects detected" with a brief description of what you examined.

UI screenshot analysis

This is a screenshot of a web page or app UI. Analyze it and identify:
1. Main call-to-action: what action is the page primarily asking users to take?
2. Navigation: list the main navigation items visible
3. Error states: any visible error messages, broken elements, or missing images
4. UX issues: anything that looks confusing, inaccessible, or inconsistent with good design
5. Content: what is the page primarily about?

This is useful for automated visual QA — run it against screenshots from your test suite to catch visual regressions that unit tests miss.

Multi-image comparison

Sending multiple images in one request lets Claude reason across them — before/after comparisons, product variants, document versions:

before_b64 = base64.standard_b64encode(Path("before.jpg").read_bytes()).decode()
after_b64 = base64.standard_b64encode(Path("after.jpg").read_bytes()).decode()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Image 1 (before renovation):"},
            {
                "type": "image",
                "source": {"type": "base64", "media_type": "image/jpeg", "data": before_b64}
            },
            {"type": "text", "text": "Image 2 (after renovation):"},
            {
                "type": "image",
                "source": {"type": "base64", "media_type": "image/jpeg", "data": after_b64}
            },
            {
                "type": "text",
                "text": "What specifically changed between these two images? List each change you can identify."
            }
        ]
    }]
)

Batch processing with structured output

For processing large volumes of images — product catalogs, document archives, screenshot libraries — use Claude Haiku instead of Sonnet. Same API, significantly lower cost:

import anthropic, base64, json
from pathlib import Path
from pydantic import BaseModel

client = anthropic.Anthropic()

class ProductInfo(BaseModel):
    name: str
    category: str
    has_price_tag: bool
    dominant_colors: list[str]
    quality_issues: list[str]

def analyze_product_image(image_path: str) -> ProductInfo:
    path = Path(image_path)
    image_data = base64.standard_b64encode(path.read_bytes()).decode()
    
    ext = path.suffix.lower().lstrip(".")
    media_type = f"image/{'jpeg' if ext == 'jpg' else ext}"
    
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Haiku for cost-efficient batch processing
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": media_type, "data": image_data}
                },
                {
                    "type": "text",
                    "text": f"""Analyze this product image. Return valid JSON matching this schema:
{json.dumps(ProductInfo.model_json_schema(), indent=2)}

For quality_issues, list any visible defects, damage, or quality problems. 
Empty list if none found."""
                }
            ]
        }]
    )
    
    return ProductInfo.model_validate_json(response.content[0].text)

# Process a directory of product images
def batch_analyze(image_dir: str) -> list[dict]:
    results = []
    for image_path in Path(image_dir).glob("*.jpg"):
        try:
            info = analyze_product_image(str(image_path))
            results.append({"file": image_path.name, **info.model_dump()})
        except Exception as e:
            results.append({"file": image_path.name, "error": str(e)})
    return results

For very large batches (thousands of images), look at the Anthropic Batch API — it processes requests asynchronously at 50% lower cost.

Understanding vision costs

Images are billed as tokens. The token count depends on image dimensions:

India developers: AICredits lets you call the Claude Vision API with INR / UPI billing — useful for bulk image processing workloads billed in rupees.

A 1024×1024 image ≈ 1,600 input tokens
A 512×512 image ≈ 400 input tokens
A 2048×2048 image ≈ 6,400 input tokens

At Claude Sonnet 4.6 pricing ($3/1M input tokens):

1024×1024 image: ~$0.005 per image
Processing 1,000 images: ~$5

At Claude Haiku pricing ($0.80/1M input tokens):

1024×1024 image: ~$0.001 per image
Processing 1,000 images: ~$1.30

Claude Vision vs GPT-4o Vision vs Gemini Flash Vision

Capability	Claude Sonnet 4.6	GPT-4o	Gemini Flash
Images per request	Up to 20	Up to 10	Up to 16
Max image size	20MB	20MB	20MB
OCR quality	Excellent	Excellent	Very good
Chart and graph reading	Excellent	Good	Good
Context window (with images)	200K tokens	128K tokens	1M tokens
Cost per 1K images (budget tier)	~$1.30 (Haiku)	~$1.50 (4o-mini)	~$0.40 (Flash)
Structured JSON output	Strong	Strong	Good

The multimodal prompting lesson covers the principles behind effective vision prompts in more depth — the same patterns that make text prompts more precise apply equally to image analysis.

Claude Vision API — Complete Guide to Image Analysis and Understanding

What Claude Vision can handle

Sending images: URL vs base64

Prompt patterns by task

Multi-image comparison

Batch processing with structured output

Understanding vision costs

Claude Vision vs GPT-4o Vision vs Gemini Flash Vision

Practical patterns worth knowing

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

50 Best AI Prompts for Claude That Actually Work (2026)

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Claude Vision API — Complete Guide to Image Analysis and Understanding

What Claude Vision can handle

Sending images: URL vs base64

Prompt patterns by task

Multi-image comparison

Batch processing with structured output

Understanding vision costs

Claude Vision vs GPT-4o Vision vs Gemini Flash Vision

Practical patterns worth knowing

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

50 Best AI Prompts for Claude That Actually Work (2026)

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)