What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

AI document processing agent — extract structured data from PDFs and invoices

A client of mine processes 300 vendor invoices a month. Manual data entry was taking 2 hours a day — one person, copy-pasting vendor names, GSTIN numbers, line items, and totals into Tally. After building this agent, the same 300 invoices process in 15 minutes, with human review only for the ~8% the agent flags as low-confidence.

The extraction cost: ₹0.50–2 per invoice. Manual entry cost: ₹15–50 per invoice. The math is obvious.

This post builds the complete pipeline: text PDF extraction with pdfplumber, OCR for scanned documents, Claude-based structured extraction with confidence scoring, and overnight batch processing for large volumes.

The extraction stack

Three tools for three scenarios:

pdfplumber — for text-based PDFs (computer-generated invoices, bank statements). Fast, free, no API calls needed for extraction.
pytesseract + pdf2image — for scanned PDFs (photographed receipts, photocopied invoices). OCR path.
Claude's vision API — fallback for complex layouts, tables, or mixed content that text extraction gets wrong.

pip install pdfplumber pytesseract pdf2image anthropic pydantic
# Also install Tesseract OCR: brew install tesseract (macOS) or apt install tesseract-ocr (Linux)

The extraction agent

Start with a Pydantic model that defines exactly what you want out of each invoice:

from pydantic import BaseModel, Field
from typing import Optional
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
import anthropic
import json
import os

class LineItem(BaseModel):
    description: str
    hsn_code: Optional[str] = None
    quantity: float
    rate: float
    amount: float

class InvoiceData(BaseModel):
    vendor_name: str
    vendor_gstin: Optional[str] = Field(None, description="15-character GSTIN")
    buyer_gstin: Optional[str] = None
    invoice_number: str
    invoice_date: str = Field(description="DD/MM/YYYY or YYYY-MM-DD")
    line_items: list[LineItem]
    taxable_amount: float
    cgst: float = 0.0
    sgst: float = 0.0
    igst: float = 0.0
    total: float
    currency: str = "INR"

class ExtractionResult(BaseModel):
    data: Optional[InvoiceData]
    confidence: int = Field(ge=1, le=5, description="Overall extraction confidence 1-5")
    low_confidence_fields: list[str] = Field(default_factory=list, description="Fields the model is uncertain about")
    extraction_method: str
    raw_text_preview: str

Text PDF extraction

client = anthropic.Anthropic()

def extract_text_from_pdf(pdf_path: str) -> tuple[str, str]:
    """Returns (extracted_text, method_used)."""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            pages_text = []
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    pages_text.append(text)
            
            if pages_text:
                return "\n\n".join(pages_text), "pdfplumber"
    except Exception:
        pass
    
    # Fall back to OCR
    return ocr_pdf(pdf_path), "tesseract"

def ocr_pdf(pdf_path: str) -> str:
    """OCR a PDF using pytesseract."""
    images = convert_from_path(pdf_path, dpi=300)
    texts = []
    for image in images:
        text = pytesseract.image_to_string(image, lang="eng+hin")  # English + Hindi
        texts.append(text)
    return "\n\n".join(texts)

def extract_invoice(pdf_path: str) -> ExtractionResult:
    raw_text, method = extract_text_from_pdf(pdf_path)
    
    if not raw_text.strip():
        return ExtractionResult(
            data=None,
            confidence=1,
            low_confidence_fields=["all"],
            extraction_method="failed",
            raw_text_preview="",
        )
    
    schema = InvoiceData.model_json_schema()
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": f"""Extract invoice data from this text. Return a JSON object with two fields:
1. "data": the invoice fields matching this schema: {json.dumps(schema)}
2. "confidence": integer 1–5 (5 = completely confident in all fields)
3. "low_confidence_fields": list of field names you're uncertain about (empty list if confident in all)

Rules:
- GSTIN format: 15 characters, starts with 2-digit state code
- If a field is not present in the document, use null
- amounts are in INR as floats
- If GST type is unclear (interstate vs intrastate), check if IGST or CGST+SGST is mentioned
- Return only the JSON object, no other text

Invoice text:
{raw_text[:6000]}"""
        }],
    )
    
    try:
        result_json = json.loads(response.content[0].text)
        invoice_data = InvoiceData.model_validate(result_json.get("data", {}))
        
        return ExtractionResult(
            data=invoice_data,
            confidence=result_json.get("confidence", 3),
            low_confidence_fields=result_json.get("low_confidence_fields", []),
            extraction_method=method,
            raw_text_preview=raw_text[:500],
        )
    except (json.JSONDecodeError, Exception) as e:
        return ExtractionResult(
            data=None,
            confidence=1,
            low_confidence_fields=["all"],
            extraction_method=method,
            raw_text_preview=raw_text[:500],
        )

Vision API fallback for complex layouts

Some invoices have complex table layouts that pdfplumber misses. For these, pass the PDF page as an image directly to Claude's vision API:

import base64
from pdf2image import convert_from_path

def extract_with_vision(pdf_path: str) -> ExtractionResult:
    """Use Claude's vision API for complex layout PDFs."""
    images = convert_from_path(pdf_path, dpi=200, first_page=1, last_page=1)
    if not images:
        return ExtractionResult(data=None, confidence=1, low_confidence_fields=["all"], extraction_method="vision_failed", raw_text_preview="")
    
    # Convert PIL image to base64
    import io
    buffer = io.BytesIO()
    images[0].save(buffer, format="PNG")
    image_data = base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
    
    schema = InvoiceData.model_json_schema()
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": f"""Extract invoice data from this image. Return JSON matching:
Schema: {json.dumps(schema)}

Include confidence (1–5) and low_confidence_fields list.
Return only the JSON object.""",
                }
            ],
        }],
    )
    
    try:
        result_json = json.loads(response.content[0].text)
        invoice_data = InvoiceData.model_validate(result_json.get("data", {}))
        return ExtractionResult(
            data=invoice_data,
            confidence=result_json.get("confidence", 3),
            low_confidence_fields=result_json.get("low_confidence_fields", []),
            extraction_method="vision",
            raw_text_preview="(vision extraction — no raw text)",
        )
    except Exception:
        return ExtractionResult(data=None, confidence=1, low_confidence_fields=["all"], extraction_method="vision_failed", raw_text_preview="")

def smart_extract_invoice(pdf_path: str) -> ExtractionResult:
    """Try text extraction first; fall back to vision if confidence is low."""
    result = extract_invoice(pdf_path)
    
    if result.confidence < 3 or result.data is None:
        vision_result = extract_with_vision(pdf_path)
        if vision_result.confidence > result.confidence:
            return vision_result
    
    return result

Processing a batch of invoices

For monthly invoice runs (100–500 invoices), use the Anthropic Batch API for 50% off token costs:

def batch_extract_invoices(invoice_paths: list[str]) -> list[dict]:
    """Process invoices overnight using the Batch API."""
    # First, extract text from all PDFs
    extracted_texts = {}
    for path in invoice_paths:
        text, method = extract_text_from_pdf(path)
        extracted_texts[path] = {"text": text, "method": method}
    
    schema = InvoiceData.model_json_schema()
    
    # Build batch requests
    batch_requests = []
    for path, data in extracted_texts.items():
        if not data["text"].strip():
            continue
        
        batch_requests.append({
            "custom_id": path,
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1500,
                "messages": [{
                    "role": "user",
                    "content": f"Extract invoice data. Return JSON: schema={json.dumps(schema)}, confidence(1-5), low_confidence_fields.\n\n{data['text'][:6000]}"
                }],
            }
        })
    
    # Submit batch
    batch = client.messages.batches.create(requests=batch_requests)
    print(f"Batch submitted: {batch.id} — {len(batch_requests)} invoices")
    
    # Poll for results (typically 15–60 minutes)
    import time
    while True:
        status = client.messages.batches.retrieve(batch.id)
        print(f"Status: {status.processing_status} — {status.request_counts}")
        if status.processing_status == "ended":
            break
        time.sleep(60)
    
    # Collect results
    results = []
    for result in client.messages.batches.results(batch.id):
        if result.result.type == "succeeded":
            try:
                parsed = json.loads(result.result.message.content[0].text)
                results.append({
                    "path": result.custom_id,
                    "data": InvoiceData.model_validate(parsed.get("data", {})),
                    "confidence": parsed.get("confidence", 3),
                    "low_confidence_fields": parsed.get("low_confidence_fields", []),
                })
            except Exception as e:
                results.append({"path": result.custom_id, "error": str(e)})
    
    return results

See the Batch API guide for polling patterns and error handling.

India-specific extraction notes

GSTIN validation: A valid GSTIN is exactly 15 characters. The first two are the state code (07 = Delhi, 27 = Maharashtra, 29 = Karnataka, etc.). Add this to your post-extraction validation:

import re

def validate_gstin(gstin: str | None) -> bool:
    if not gstin:
        return True  # Optional field
    pattern = r"^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$"
    return bool(re.match(pattern, gstin))

HSN codes: Most B2B GST invoices include 4–8 digit HSN codes per line item. Extract these — they're required for input tax credit claims.

CGST/SGST vs IGST: Intrastate transactions use CGST + SGST (equal split). Interstate transactions use IGST (full rate). If the vendor and buyer GSTINs start with the same two digits, it's intrastate.

Form 16: If you're processing salary documents for HR, Form 16 has a different structure — employer TAN instead of GSTIN, salary components instead of line items. Build a separate schema for it.

Routing low-confidence extractions for human review

CONFIDENCE_THRESHOLD = 3  # Below this: flag for human review

def process_invoice_batch(invoice_paths: list[str]) -> dict:
    results = {"auto_processed": [], "needs_review": [], "failed": []}
    
    for path in invoice_paths:
        result = smart_extract_invoice(path)
        
        if result.data is None:
            results["failed"].append({"path": path, "reason": "extraction_failed"})
        elif result.confidence < CONFIDENCE_THRESHOLD or result.low_confidence_fields:
            results["needs_review"].append({
                "path": path,
                "data": result.data,
                "confidence": result.confidence,
                "uncertain_fields": result.low_confidence_fields,
            })
        else:
            results["auto_processed"].append({
                "path": path,
                "data": result.data,
            })
    
    print(f"Auto-processed: {len(results['auto_processed'])}")
    print(f"Needs review: {len(results['needs_review'])} ({len(results['low_confidence_fields'] if 'low_confidence_fields' in results else [])} uncertain fields)")
    print(f"Failed: {len(results['failed'])}")
    
    return results

In production, the "needs review" items go to a simple web UI where a human can confirm or correct the extracted values before they're written to Tally/Zoho Books.

Cost breakdown

Invoice type	Method	Cost per invoice
Standard digital PDF	pdfplumber + Sonnet	~₹0.50
Complex layout	pdfplumber + vision	~₹1.50
Scanned/photographed	OCR + Sonnet	~₹1.50
Batch processing (overnight)	Batch API	50% of above

For 300 invoices/month with mixed types: ~₹300–450/month. Vs ₹4,500–15,000 for manual entry at ₹15–50 each.

The Pydantic AI post covers how to restructure this as a Pydantic AI agent if you want typed dependency injection and easier testing. The structured outputs post has more patterns for getting reliable JSON from LLM responses.

The extraction cost: ₹0.50–2 per invoice. Manual entry cost: ₹15–50 per invoice. The math is obvious.

The extraction stack

Three tools for three scenarios:

pdfplumber — for text-based PDFs (computer-generated invoices, bank statements). Fast, free, no API calls needed for extraction.
pytesseract + pdf2image — for scanned PDFs (photographed receipts, photocopied invoices). OCR path.
Claude's vision API — fallback for complex layouts, tables, or mixed content that text extraction gets wrong.

pip install pdfplumber pytesseract pdf2image anthropic pydantic
# Also install Tesseract OCR: brew install tesseract (macOS) or apt install tesseract-ocr (Linux)

The extraction agent

Start with a Pydantic model that defines exactly what you want out of each invoice:

from pydantic import BaseModel, Field
from typing import Optional
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
import anthropic
import json
import os

class LineItem(BaseModel):
    description: str
    hsn_code: Optional[str] = None
    quantity: float
    rate: float
    amount: float

class InvoiceData(BaseModel):
    vendor_name: str
    vendor_gstin: Optional[str] = Field(None, description="15-character GSTIN")
    buyer_gstin: Optional[str] = None
    invoice_number: str
    invoice_date: str = Field(description="DD/MM/YYYY or YYYY-MM-DD")
    line_items: list[LineItem]
    taxable_amount: float
    cgst: float = 0.0
    sgst: float = 0.0
    igst: float = 0.0
    total: float
    currency: str = "INR"

class ExtractionResult(BaseModel):
    data: Optional[InvoiceData]
    confidence: int = Field(ge=1, le=5, description="Overall extraction confidence 1-5")
    low_confidence_fields: list[str] = Field(default_factory=list, description="Fields the model is uncertain about")
    extraction_method: str
    raw_text_preview: str

Text PDF extraction

client = anthropic.Anthropic()

def extract_text_from_pdf(pdf_path: str) -> tuple[str, str]:
    """Returns (extracted_text, method_used)."""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            pages_text = []
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    pages_text.append(text)
            
            if pages_text:
                return "\n\n".join(pages_text), "pdfplumber"
    except Exception:
        pass
    
    # Fall back to OCR
    return ocr_pdf(pdf_path), "tesseract"

def ocr_pdf(pdf_path: str) -> str:
    """OCR a PDF using pytesseract."""
    images = convert_from_path(pdf_path, dpi=300)
    texts = []
    for image in images:
        text = pytesseract.image_to_string(image, lang="eng+hin")  # English + Hindi
        texts.append(text)
    return "\n\n".join(texts)

def extract_invoice(pdf_path: str) -> ExtractionResult:
    raw_text, method = extract_text_from_pdf(pdf_path)
    
    if not raw_text.strip():
        return ExtractionResult(
            data=None,
            confidence=1,
            low_confidence_fields=["all"],
            extraction_method="failed",
            raw_text_preview="",
        )
    
    schema = InvoiceData.model_json_schema()
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": f"""Extract invoice data from this text. Return a JSON object with two fields:
1. "data": the invoice fields matching this schema: {json.dumps(schema)}
2. "confidence": integer 1–5 (5 = completely confident in all fields)
3. "low_confidence_fields": list of field names you're uncertain about (empty list if confident in all)

Rules:
- GSTIN format: 15 characters, starts with 2-digit state code
- If a field is not present in the document, use null
- amounts are in INR as floats
- If GST type is unclear (interstate vs intrastate), check if IGST or CGST+SGST is mentioned
- Return only the JSON object, no other text

Invoice text:
{raw_text[:6000]}"""
        }],
    )
    
    try:
        result_json = json.loads(response.content[0].text)
        invoice_data = InvoiceData.model_validate(result_json.get("data", {}))
        
        return ExtractionResult(
            data=invoice_data,
            confidence=result_json.get("confidence", 3),
            low_confidence_fields=result_json.get("low_confidence_fields", []),
            extraction_method=method,
            raw_text_preview=raw_text[:500],
        )
    except (json.JSONDecodeError, Exception) as e:
        return ExtractionResult(
            data=None,
            confidence=1,
            low_confidence_fields=["all"],
            extraction_method=method,
            raw_text_preview=raw_text[:500],
        )

Vision API fallback for complex layouts

Some invoices have complex table layouts that pdfplumber misses. For these, pass the PDF page as an image directly to Claude's vision API:

import base64
from pdf2image import convert_from_path

def extract_with_vision(pdf_path: str) -> ExtractionResult:
    """Use Claude's vision API for complex layout PDFs."""
    images = convert_from_path(pdf_path, dpi=200, first_page=1, last_page=1)
    if not images:
        return ExtractionResult(data=None, confidence=1, low_confidence_fields=["all"], extraction_method="vision_failed", raw_text_preview="")
    
    # Convert PIL image to base64
    import io
    buffer = io.BytesIO()
    images[0].save(buffer, format="PNG")
    image_data = base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
    
    schema = InvoiceData.model_json_schema()
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": f"""Extract invoice data from this image. Return JSON matching:
Schema: {json.dumps(schema)}

Include confidence (1–5) and low_confidence_fields list.
Return only the JSON object.""",
                }
            ],
        }],
    )
    
    try:
        result_json = json.loads(response.content[0].text)
        invoice_data = InvoiceData.model_validate(result_json.get("data", {}))
        return ExtractionResult(
            data=invoice_data,
            confidence=result_json.get("confidence", 3),
            low_confidence_fields=result_json.get("low_confidence_fields", []),
            extraction_method="vision",
            raw_text_preview="(vision extraction — no raw text)",
        )
    except Exception:
        return ExtractionResult(data=None, confidence=1, low_confidence_fields=["all"], extraction_method="vision_failed", raw_text_preview="")

def smart_extract_invoice(pdf_path: str) -> ExtractionResult:
    """Try text extraction first; fall back to vision if confidence is low."""
    result = extract_invoice(pdf_path)
    
    if result.confidence < 3 or result.data is None:
        vision_result = extract_with_vision(pdf_path)
        if vision_result.confidence > result.confidence:
            return vision_result
    
    return result

Processing a batch of invoices

For monthly invoice runs (100–500 invoices), use the Anthropic Batch API for 50% off token costs:

def batch_extract_invoices(invoice_paths: list[str]) -> list[dict]:
    """Process invoices overnight using the Batch API."""
    # First, extract text from all PDFs
    extracted_texts = {}
    for path in invoice_paths:
        text, method = extract_text_from_pdf(path)
        extracted_texts[path] = {"text": text, "method": method}
    
    schema = InvoiceData.model_json_schema()
    
    # Build batch requests
    batch_requests = []
    for path, data in extracted_texts.items():
        if not data["text"].strip():
            continue
        
        batch_requests.append({
            "custom_id": path,
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1500,
                "messages": [{
                    "role": "user",
                    "content": f"Extract invoice data. Return JSON: schema={json.dumps(schema)}, confidence(1-5), low_confidence_fields.\n\n{data['text'][:6000]}"
                }],
            }
        })
    
    # Submit batch
    batch = client.messages.batches.create(requests=batch_requests)
    print(f"Batch submitted: {batch.id} — {len(batch_requests)} invoices")
    
    # Poll for results (typically 15–60 minutes)
    import time
    while True:
        status = client.messages.batches.retrieve(batch.id)
        print(f"Status: {status.processing_status} — {status.request_counts}")
        if status.processing_status == "ended":
            break
        time.sleep(60)
    
    # Collect results
    results = []
    for result in client.messages.batches.results(batch.id):
        if result.result.type == "succeeded":
            try:
                parsed = json.loads(result.result.message.content[0].text)
                results.append({
                    "path": result.custom_id,
                    "data": InvoiceData.model_validate(parsed.get("data", {})),
                    "confidence": parsed.get("confidence", 3),
                    "low_confidence_fields": parsed.get("low_confidence_fields", []),
                })
            except Exception as e:
                results.append({"path": result.custom_id, "error": str(e)})
    
    return results

See the Batch API guide for polling patterns and error handling.

India-specific extraction notes

GSTIN validation: A valid GSTIN is exactly 15 characters. The first two are the state code (07 = Delhi, 27 = Maharashtra, 29 = Karnataka, etc.). Add this to your post-extraction validation:

import re

def validate_gstin(gstin: str | None) -> bool:
    if not gstin:
        return True  # Optional field
    pattern = r"^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$"
    return bool(re.match(pattern, gstin))

HSN codes: Most B2B GST invoices include 4–8 digit HSN codes per line item. Extract these — they're required for input tax credit claims.

Routing low-confidence extractions for human review

CONFIDENCE_THRESHOLD = 3  # Below this: flag for human review

def process_invoice_batch(invoice_paths: list[str]) -> dict:
    results = {"auto_processed": [], "needs_review": [], "failed": []}
    
    for path in invoice_paths:
        result = smart_extract_invoice(path)
        
        if result.data is None:
            results["failed"].append({"path": path, "reason": "extraction_failed"})
        elif result.confidence < CONFIDENCE_THRESHOLD or result.low_confidence_fields:
            results["needs_review"].append({
                "path": path,
                "data": result.data,
                "confidence": result.confidence,
                "uncertain_fields": result.low_confidence_fields,
            })
        else:
            results["auto_processed"].append({
                "path": path,
                "data": result.data,
            })
    
    print(f"Auto-processed: {len(results['auto_processed'])}")
    print(f"Needs review: {len(results['needs_review'])} ({len(results['low_confidence_fields'] if 'low_confidence_fields' in results else [])} uncertain fields)")
    print(f"Failed: {len(results['failed'])}")
    
    return results

In production, the "needs review" items go to a simple web UI where a human can confirm or correct the extracted values before they're written to Tally/Zoho Books.

Cost breakdown

Invoice type	Method	Cost per invoice
Standard digital PDF	pdfplumber + Sonnet	~₹0.50
Complex layout	pdfplumber + vision	~₹1.50
Scanned/photographed	OCR + Sonnet	~₹1.50
Batch processing (overnight)	Batch API	50% of above

For 300 invoices/month with mixed types: ~₹300–450/month. Vs ₹4,500–15,000 for manual entry at ₹15–50 each.

AI document processing agent — extract structured data from PDFs and invoices

The extraction stack

The extraction agent

Text PDF extraction

Vision API fallback for complex layouts

Processing a batch of invoices

India-specific extraction notes

Routing low-confidence extractions for human review

Cost breakdown

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Claude API vs OpenAI API — Developer Comparison Guide (2026)

AI document processing agent — extract structured data from PDFs and invoices

The extraction stack

The extraction agent

Text PDF extraction

Vision API fallback for complex layouts

Processing a batch of invoices

India-specific extraction notes

Routing low-confidence extractions for human review

Cost breakdown

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Claude API vs OpenAI API — Developer Comparison Guide (2026)