A client of mine processes 300 vendor invoices a month. Manual data entry was taking 2 hours a day — one person, copy-pasting vendor names, GSTIN numbers, line items, and totals into Tally. After building this agent, the same 300 invoices process in 15 minutes, with human review only for the ~8% the agent flags as low-confidence.
The extraction cost: ₹0.50–2 per invoice. Manual entry cost: ₹15–50 per invoice. The math is obvious.
This post builds the complete pipeline: text PDF extraction with pdfplumber, OCR for scanned documents, Claude-based structured extraction with confidence scoring, and overnight batch processing for large volumes.
The extraction stack
Three tools for three scenarios:
pdfplumber— for text-based PDFs (computer-generated invoices, bank statements). Fast, free, no API calls needed for extraction.pytesseract+pdf2image— for scanned PDFs (photographed receipts, photocopied invoices). OCR path.- Claude's vision API — fallback for complex layouts, tables, or mixed content that text extraction gets wrong.
pip install pdfplumber pytesseract pdf2image anthropic pydantic
# Also install Tesseract OCR: brew install tesseract (macOS) or apt install tesseract-ocr (Linux)
The extraction agent
Start with a Pydantic model that defines exactly what you want out of each invoice:
from pydantic import BaseModel, Field
from typing import Optional
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
import anthropic
import json
import os
class LineItem(BaseModel):
description: str
hsn_code: Optional[str] = None
quantity: float
rate: float
amount: float
class InvoiceData(BaseModel):
vendor_name: str
vendor_gstin: Optional[str] = Field(None, description="15-character GSTIN")
buyer_gstin: Optional[str] = None
invoice_number: str
invoice_date: str = Field(description="DD/MM/YYYY or YYYY-MM-DD")
line_items: list[LineItem]
taxable_amount: float
cgst: float = 0.0
sgst: float = 0.0
igst: float = 0.0
total: float
currency: str = "INR"
class ExtractionResult(BaseModel):
data: Optional[InvoiceData]
confidence: int = Field(ge=1, le=5, description="Overall extraction confidence 1-5")
low_confidence_fields: list[str] = Field(default_factory=list, description="Fields the model is uncertain about")
extraction_method: str
raw_text_preview: str
Text PDF extraction
client = anthropic.Anthropic()
def extract_text_from_pdf(pdf_path: str) -> tuple[str, str]:
"""Returns (extracted_text, method_used)."""
try:
with pdfplumber.open(pdf_path) as pdf:
pages_text = []
for page in pdf.pages:
text = page.extract_text()
if text:
pages_text.append(text)
if pages_text:
return "\n\n".join(pages_text), "pdfplumber"
except Exception:
pass
# Fall back to OCR
return ocr_pdf(pdf_path), "tesseract"
def ocr_pdf(pdf_path: str) -> str:
"""OCR a PDF using pytesseract."""
images = convert_from_path(pdf_path, dpi=300)
texts = []
for image in images:
text = pytesseract.image_to_string(image, lang="eng+hin") # English + Hindi
texts.append(text)
return "\n\n".join(texts)
def extract_invoice(pdf_path: str) -> ExtractionResult:
raw_text, method = extract_text_from_pdf(pdf_path)
if not raw_text.strip():
return ExtractionResult(
data=None,
confidence=1,
low_confidence_fields=["all"],
extraction_method="failed",
raw_text_preview="",
)
schema = InvoiceData.model_json_schema()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1500,
messages=[{
"role": "user",
"content": f"""Extract invoice data from this text. Return a JSON object with two fields:
1. "data": the invoice fields matching this schema: {json.dumps(schema)}
2. "confidence": integer 1–5 (5 = completely confident in all fields)
3. "low_confidence_fields": list of field names you're uncertain about (empty list if confident in all)
Rules:
- GSTIN format: 15 characters, starts with 2-digit state code
- If a field is not present in the document, use null
- amounts are in INR as floats
- If GST type is unclear (interstate vs intrastate), check if IGST or CGST+SGST is mentioned
- Return only the JSON object, no other text
Invoice text:
{raw_text[:6000]}"""
}],
)
try:
result_json = json.loads(response.content[0].text)
invoice_data = InvoiceData.model_validate(result_json.get("data", {}))
return ExtractionResult(
data=invoice_data,
confidence=result_json.get("confidence", 3),
low_confidence_fields=result_json.get("low_confidence_fields", []),
extraction_method=method,
raw_text_preview=raw_text[:500],
)
except (json.JSONDecodeError, Exception) as e:
return ExtractionResult(
data=None,
confidence=1,
low_confidence_fields=["all"],
extraction_method=method,
raw_text_preview=raw_text[:500],
)
Vision API fallback for complex layouts
Some invoices have complex table layouts that pdfplumber misses. For these, pass the PDF page as an image directly to Claude's vision API:
import base64
from pdf2image import convert_from_path
def extract_with_vision(pdf_path: str) -> ExtractionResult:
"""Use Claude's vision API for complex layout PDFs."""
images = convert_from_path(pdf_path, dpi=200, first_page=1, last_page=1)
if not images:
return ExtractionResult(data=None, confidence=1, low_confidence_fields=["all"], extraction_method="vision_failed", raw_text_preview="")
# Convert PIL image to base64
import io
buffer = io.BytesIO()
images[0].save(buffer, format="PNG")
image_data = base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
schema = InvoiceData.model_json_schema()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1500,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": f"""Extract invoice data from this image. Return JSON matching:
Schema: {json.dumps(schema)}
Include confidence (1–5) and low_confidence_fields list.
Return only the JSON object.""",
}
],
}],
)
try:
result_json = json.loads(response.content[0].text)
invoice_data = InvoiceData.model_validate(result_json.get("data", {}))
return ExtractionResult(
data=invoice_data,
confidence=result_json.get("confidence", 3),
low_confidence_fields=result_json.get("low_confidence_fields", []),
extraction_method="vision",
raw_text_preview="(vision extraction — no raw text)",
)
except Exception:
return ExtractionResult(data=None, confidence=1, low_confidence_fields=["all"], extraction_method="vision_failed", raw_text_preview="")
def smart_extract_invoice(pdf_path: str) -> ExtractionResult:
"""Try text extraction first; fall back to vision if confidence is low."""
result = extract_invoice(pdf_path)
if result.confidence < 3 or result.data is None:
vision_result = extract_with_vision(pdf_path)
if vision_result.confidence > result.confidence:
return vision_result
return result
Processing a batch of invoices
For monthly invoice runs (100–500 invoices), use the Anthropic Batch API for 50% off token costs:
def batch_extract_invoices(invoice_paths: list[str]) -> list[dict]:
"""Process invoices overnight using the Batch API."""
# First, extract text from all PDFs
extracted_texts = {}
for path in invoice_paths:
text, method = extract_text_from_pdf(path)
extracted_texts[path] = {"text": text, "method": method}
schema = InvoiceData.model_json_schema()
# Build batch requests
batch_requests = []
for path, data in extracted_texts.items():
if not data["text"].strip():
continue
batch_requests.append({
"custom_id": path,
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 1500,
"messages": [{
"role": "user",
"content": f"Extract invoice data. Return JSON: schema={json.dumps(schema)}, confidence(1-5), low_confidence_fields.\n\n{data['text'][:6000]}"
}],
}
})
# Submit batch
batch = client.messages.batches.create(requests=batch_requests)
print(f"Batch submitted: {batch.id} — {len(batch_requests)} invoices")
# Poll for results (typically 15–60 minutes)
import time
while True:
status = client.messages.batches.retrieve(batch.id)
print(f"Status: {status.processing_status} — {status.request_counts}")
if status.processing_status == "ended":
break
time.sleep(60)
# Collect results
results = []
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
try:
parsed = json.loads(result.result.message.content[0].text)
results.append({
"path": result.custom_id,
"data": InvoiceData.model_validate(parsed.get("data", {})),
"confidence": parsed.get("confidence", 3),
"low_confidence_fields": parsed.get("low_confidence_fields", []),
})
except Exception as e:
results.append({"path": result.custom_id, "error": str(e)})
return results
See the Batch API guide for polling patterns and error handling.
India-specific extraction notes
GSTIN validation: A valid GSTIN is exactly 15 characters. The first two are the state code (07 = Delhi, 27 = Maharashtra, 29 = Karnataka, etc.). Add this to your post-extraction validation:
import re
def validate_gstin(gstin: str | None) -> bool:
if not gstin:
return True # Optional field
pattern = r"^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$"
return bool(re.match(pattern, gstin))
HSN codes: Most B2B GST invoices include 4–8 digit HSN codes per line item. Extract these — they're required for input tax credit claims.
CGST/SGST vs IGST: Intrastate transactions use CGST + SGST (equal split). Interstate transactions use IGST (full rate). If the vendor and buyer GSTINs start with the same two digits, it's intrastate.
Form 16: If you're processing salary documents for HR, Form 16 has a different structure — employer TAN instead of GSTIN, salary components instead of line items. Build a separate schema for it.
Routing low-confidence extractions for human review
CONFIDENCE_THRESHOLD = 3 # Below this: flag for human review
def process_invoice_batch(invoice_paths: list[str]) -> dict:
results = {"auto_processed": [], "needs_review": [], "failed": []}
for path in invoice_paths:
result = smart_extract_invoice(path)
if result.data is None:
results["failed"].append({"path": path, "reason": "extraction_failed"})
elif result.confidence < CONFIDENCE_THRESHOLD or result.low_confidence_fields:
results["needs_review"].append({
"path": path,
"data": result.data,
"confidence": result.confidence,
"uncertain_fields": result.low_confidence_fields,
})
else:
results["auto_processed"].append({
"path": path,
"data": result.data,
})
print(f"Auto-processed: {len(results['auto_processed'])}")
print(f"Needs review: {len(results['needs_review'])} ({len(results['low_confidence_fields'] if 'low_confidence_fields' in results else [])} uncertain fields)")
print(f"Failed: {len(results['failed'])}")
return results
In production, the "needs review" items go to a simple web UI where a human can confirm or correct the extracted values before they're written to Tally/Zoho Books.
Cost breakdown
| Invoice type | Method | Cost per invoice |
|---|---|---|
| Standard digital PDF | pdfplumber + Sonnet | ~₹0.50 |
| Complex layout | pdfplumber + vision | ~₹1.50 |
| Scanned/photographed | OCR + Sonnet | ~₹1.50 |
| Batch processing (overnight) | Batch API | 50% of above |
For 300 invoices/month with mixed types: ~₹300–450/month. Vs ₹4,500–15,000 for manual entry at ₹15–50 each.
The Pydantic AI post covers how to restructure this as a Pydantic AI agent if you want typed dependency injection and easier testing. The structured outputs post has more patterns for getting reliable JSON from LLM responses.



