Most teams ship prompts without systematic evaluation and wonder why outputs degrade silently in production. A prompt that works 90% of the time in testing will still fail on 10% of your real traffic — and you won't know which 10% until a user complains. Evals are what separates a prototype from a product.
A minimal eval harness takes 50 lines of Python and 2 hours to set up. Here's exactly how.
The three types of evals
Rule-based evals
Regex patterns, exact match, substring checks. Cheap, fast, deterministic. These run in milliseconds and cost nothing in API calls.
Use rule-based evals for:
- Format validation: is this valid JSON? Does the output parse?
- Constraint checking: did the response stay under 200 words?
- Critical content checks: does it mention the price? Does it contain required disclaimers?
- Structure checks: are all required sections present?
import json
import re
def eval_json_validity(output: str) -> dict:
try:
data = json.loads(output)
return {"pass": True, "score": 1.0}
except json.JSONDecodeError as e:
return {"pass": False, "score": 0.0, "error": str(e)}
def eval_word_count(output: str, max_words: int = 200) -> dict:
words = len(output.split())
return {
"pass": words <= max_words,
"score": min(1.0, max_words / max(words, 1)),
"word_count": words
}
def eval_contains_gstin(output: str) -> dict:
# GSTIN pattern: 2 digits + 10 char PAN + 1 digit + Z + 1 char
pattern = r'\d{2}[A-Z]{5}\d{4}[A-Z]{1}\d[Z]{1}[A-Z\d]{1}'
found = bool(re.search(pattern, output))
return {"pass": found, "score": 1.0 if found else 0.0}
Model-based evals (Claude-as-judge)
Use Claude to evaluate Claude's outputs on quality dimensions. This sounds circular but works well in practice — the judge model is evaluating against explicit criteria, which is different from generating an answer.
Use model-based evals for:
- Subjective quality: is this response actually helpful?
- Complex criteria: did it follow all 5 instructions from the system prompt?
- Comparing versions: is v2 of this prompt better than v1?
- Detecting subtle failures: did the model hallucinate a regulation that doesn't exist?
Human evals
Ground truth for calibrating the other two. Run periodically — not on every deployment. Human evals are expensive (time), so use them strategically: when launching a new feature, quarterly audits, and whenever your model-based judge scores drift unexpectedly.
The simplest implementation: a CSV with input, output, and a score column (1-5). Share it via Google Sheets with the person doing the rating. Aggregate weekly.
Building a minimal eval harness in Python (50 lines)
import json
import csv
from anthropic import Anthropic
from typing import Callable, List, Dict, Any
client = Anthropic()
def run_eval(
test_cases: List[Dict],
prompt_fn: Callable[[Dict], str],
eval_fn: Callable[[str, Dict], Dict],
model: str = "claude-sonnet-4-6",
effort: str = "medium"
) -> List[Dict]:
results = []
for case in test_cases:
# Generate response
response = client.messages.create(
model=model,
effort=effort,
max_tokens=2000,
messages=[{"role": "user", "content": prompt_fn(case)}]
)
output = response.content[0].text
# Evaluate
scores = eval_fn(output, case)
results.append({
"input": case,
"output": output,
"scores": scores,
"tokens": response.usage.input_tokens + response.usage.output_tokens
})
return results
def save_results(results: List[Dict], filename: str):
with open(filename, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["input", "output", "score", "tokens"])
writer.writeheader()
for r in results:
writer.writerow({
"input": json.dumps(r["input"]),
"output": r["output"],
"score": r["scores"].get("overall", 0),
"tokens": r["tokens"]
})
Using it:
# Define your test cases
test_cases = [
{
"query": "What is the GST rate on software services?",
"expected_answer": "18%",
"context": "Standard GST rates document"
},
# Add 20-50 cases to get statistically meaningful results
]
# Define your prompt function
def build_prompt(case: Dict) -> str:
return f"Answer this GST query concisely: {case['query']}"
# Define your eval function (can combine rule-based and model-based)
def eval_response(output: str, case: Dict) -> Dict:
rule_score = eval_contains_answer(output, case["expected_answer"])
quality_score = judge_response(output, case["query"])
return {
"rule_based": rule_score,
"quality": quality_score,
"overall": (rule_score + quality_score) / 2
}
results = run_eval(test_cases, build_prompt, eval_response)
save_results(results, "eval_results_2026_04_14.csv")
This is the skeleton. The eval_fn is where all the interesting work happens.
Claude-as-judge: how to write the judge prompt
The judge prompt is the most important thing to get right. A bad judge will give you meaningless scores and false confidence.
JUDGE_PROMPT = """You are evaluating an AI assistant's response to a customer support query.
Task context: {task_description}
Evaluation criteria:
1. Accuracy (1-5): Does the response correctly address the customer's issue?
2. Completeness (1-5): Are all aspects of the query addressed?
3. Tone (1-5): Is it professional and empathetic?
4. Actionability (1-5): Does the customer know exactly what to do next?
Customer query: {query}
AI response: {response}
Score each criterion 1-5. Be strict — a 5 means genuinely excellent, not just adequate.
Return JSON only, no other text:
{{"accuracy": N, "completeness": N, "tone": N, "actionability": N, "overall": N, "reasoning": "one sentence"}}"""
def judge_response(output: str, query: str, task_description: str = "") -> Dict:
judge_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
output_config={
"format": {
"type": "json_schema",
"json_schema": {
"name": "eval_scores",
"schema": {
"type": "object",
"properties": {
"accuracy": {"type": "number"},
"completeness": {"type": "number"},
"tone": {"type": "number"},
"actionability": {"type": "number"},
"overall": {"type": "number"},
"reasoning": {"type": "string"}
},
"required": ["accuracy", "completeness", "tone", "actionability", "overall", "reasoning"]
}
}
}
},
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
task_description=task_description,
query=query,
response=output
)
}]
)
return json.loads(judge_response.content[0].text)
Calibrating the judge: run 50 test cases through your judge. Then manually rate the same 50 cases. Compare. If your judge's scores correlate with your human scores (Spearman's r > 0.7), the judge is usable. If it's lower, your judge prompt needs work — usually the criteria are too vague or the 1-5 scale isn't well anchored.
A good calibration trick: include anchor examples in the judge prompt. "A score of 5 for accuracy means the response contains no factual errors and directly answers the question. A score of 1 means the response is factually wrong or completely misses the question."
Free tools for Indian developers
Langfuse open source — self-host on a ₹400-600/month VPS
Langfuse is the cleanest open-source LLM observability tool. It tracks every LLM call, stores inputs and outputs, lets you build eval datasets from production traces, and shows cost trends over time. Self-hosting on Hostinger or DigitalOcean India region runs around ₹400-600/month (roughly $5-7 USD).
# docker-compose.yml
version: "3"
services:
langfuse-server:
image: langfuse/langfuse:latest
ports:
- "3000:3000"
environment:
DATABASE_URL: postgresql://langfuse:password@db:5432/langfuse
NEXTAUTH_SECRET: your-secret-here
NEXTAUTH_URL: http://your-vps-ip:3000
SALT: your-salt-here
db:
image: postgres:15
environment:
POSTGRES_USER: langfuse
POSTGRES_PASSWORD: password
POSTGRES_DB: langfuse
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:
Then in your Python code:
from langfuse import Langfuse
langfuse = Langfuse(public_key="...", secret_key="...", host="http://your-vps-ip:3000")
The real value: once you have production traffic flowing through Langfuse, you can tag specific traces as eval examples, build datasets from real failures, and run your eval suite against those datasets. This closes the loop between production failures and eval coverage.
Weights & Biases free tier — 100GB storage, free forever
W&B is overkill for basic eval logging but excellent once you're comparing across multiple prompt versions or model configurations. The free tier gives you 100GB storage and unlimited runs. Log your eval results:
import wandb
wandb.init(project="gst-assistant-evals", config={
"model": "claude-sonnet-4-6",
"effort": "medium",
"prompt_version": "v2.3"
})
results = run_eval(test_cases, build_prompt, eval_response)
for i, result in enumerate(results):
wandb.log({
"overall_score": result["scores"]["overall"],
"tokens": result["tokens"],
"step": i
})
# Log aggregate metrics
scores = [r["scores"]["overall"] for r in results]
wandb.log({
"mean_score": sum(scores) / len(scores),
"pass_rate": sum(1 for s in scores if s >= 0.7) / len(scores),
"total_tokens": sum(r["tokens"] for r in results)
})
wandb.finish()
Plain CSV + Google Sheets
Not glamorous, but works for teams of 1-3. Export eval results to CSV, import to Google Sheets, use a pivot table to compare versions. For most early-stage products, this is genuinely enough. Don't over-engineer until you have volume that justifies it.
What to measure
| Metric | How to measure | Why it matters |
|---|---|---|
| Task completion rate | Rule-based: did it produce the required output format? | Baseline reliability |
| Hallucination rate | Model-based: does response contain claims not in context? | Trust |
| Instruction following | Rule-based + model-based: did it follow all N constraints? | Production reliability |
| Consistency | Run same prompt 10x, measure variance in scores | Reproducibility |
| Latency P50/P99 | Time from request to first token | UX |
| Cost per successful call | Total tokens / pass rate | Business viability |
For the hallucination metric specifically, the judge prompt matters a lot. The best pattern: provide the source documents, the query, and the response. Ask the judge to identify any claims in the response that aren't supported by the source documents.
India context: evaluating Hindi/Hinglish outputs
If your product serves Hindi speakers or uses Hinglish, add a language quality dimension to your judge prompt. The criteria that matter:
Natural code-mixing: does the response mix languages the way your target users actually speak, or does it feel like translated English? Real Hinglish isn't just inserting Hindi words — it has specific grammatical patterns.
Formality calibration: formal Hindi (आप) vs casual Hindi (तुम/तू) vs Hinglish varies by context. A banking chatbot should use formal Hindi. A consumer app might use casual Hinglish.
Regional markers: Mumbai Hindi sounds different from Delhi Hindi sounds different from Hyderabadi Hindi. If you're targeting a specific city, your judge should reflect that.
A practical Hindi eval criterion to add to your judge prompt: "Rate the Hindi/Hinglish naturalness 1-5. A 5 means a native Hindi speaker would read this as natural. A 3 means it's comprehensible but slightly stilted. A 1 means it reads as direct translation."
How to use eval results to choose between effort levels
This is one of the most practically useful things evals unlock. Run your eval suite at effort=low, effort=medium, and effort=high. Plot task completion rate and mean quality score against cost per call.
In my experience across customer support, document extraction, and content generation tasks:
effort=lowmatcheseffort=mediumfor simple, well-defined tasks (format conversion, extraction from clean documents)effort=mediumis the elbow point for most tasks — meaningfully better than low, not significantly worse than higheffort=highonly justifies its cost for complex reasoning tasks where errors have real business consequences
Set your production effort level at the elbow. Run evals quarterly to check if it's shifted — as you improve your prompts, tasks that needed effort=medium may work fine at effort=low.
💡 Want to go deeper? The Advanced track covers evaluation frameworks as part of the prompt engineering curriculum, including how to build automated regression tests for prompts.
Next steps
- Claude 4.6 effort parameter and cost optimization — detailed cost/quality tradeoffs across effort levels
- Prompt caching and API cost reduction — reduce eval costs by 80% with prefix caching
- Evaluation frameworks lesson — the theory behind what to measure and why
- Claude Opus 4.6 prompting guide — when to use Opus as your judge model



