What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

How to Evaluate Your LLM Outputs: A Practical Eval Framework for Indian Developers

Most teams ship prompts without systematic evaluation and wonder why outputs degrade silently in production. A prompt that works 90% of the time in testing will still fail on 10% of your real traffic — and you won't know which 10% until a user complains. Evals are what separates a prototype from a product.

A minimal eval harness takes 50 lines of Python and 2 hours to set up. Here's exactly how.

The three types of evals

Rule-based evals

Regex patterns, exact match, substring checks. Cheap, fast, deterministic. These run in milliseconds and cost nothing in API calls.

Use rule-based evals for:

Format validation: is this valid JSON? Does the output parse?
Constraint checking: did the response stay under 200 words?
Critical content checks: does it mention the price? Does it contain required disclaimers?
Structure checks: are all required sections present?

import json
import re

def eval_json_validity(output: str) -> dict:
    try:
        data = json.loads(output)
        return {"pass": True, "score": 1.0}
    except json.JSONDecodeError as e:
        return {"pass": False, "score": 0.0, "error": str(e)}

def eval_word_count(output: str, max_words: int = 200) -> dict:
    words = len(output.split())
    return {
        "pass": words <= max_words,
        "score": min(1.0, max_words / max(words, 1)),
        "word_count": words
    }

def eval_contains_gstin(output: str) -> dict:
    # GSTIN pattern: 2 digits + 10 char PAN + 1 digit + Z + 1 char
    pattern = r'\d{2}[A-Z]{5}\d{4}[A-Z]{1}\d[Z]{1}[A-Z\d]{1}'
    found = bool(re.search(pattern, output))
    return {"pass": found, "score": 1.0 if found else 0.0}

Model-based evals (Claude-as-judge)

Use Claude to evaluate Claude's outputs on quality dimensions. This sounds circular but works well in practice — the judge model is evaluating against explicit criteria, which is different from generating an answer.

Use model-based evals for:

Subjective quality: is this response actually helpful?
Complex criteria: did it follow all 5 instructions from the system prompt?
Comparing versions: is v2 of this prompt better than v1?
Detecting subtle failures: did the model hallucinate a regulation that doesn't exist?

Human evals

Ground truth for calibrating the other two. Run periodically — not on every deployment. Human evals are expensive (time), so use them strategically: when launching a new feature, quarterly audits, and whenever your model-based judge scores drift unexpectedly.

The simplest implementation: a CSV with input, output, and a score column (1-5). Share it via Google Sheets with the person doing the rating. Aggregate weekly.

Building a minimal eval harness in Python (50 lines)

import json
import csv
from anthropic import Anthropic
from typing import Callable, List, Dict, Any

client = Anthropic()

def run_eval(
    test_cases: List[Dict],
    prompt_fn: Callable[[Dict], str],
    eval_fn: Callable[[str, Dict], Dict],
    model: str = "claude-sonnet-4-6",
    effort: str = "medium"
) -> List[Dict]:
    results = []
    for case in test_cases:
        # Generate response
        response = client.messages.create(
            model=model,
            effort=effort,
            max_tokens=2000,
            messages=[{"role": "user", "content": prompt_fn(case)}]
        )
        output = response.content[0].text

        # Evaluate
        scores = eval_fn(output, case)

        results.append({
            "input": case,
            "output": output,
            "scores": scores,
            "tokens": response.usage.input_tokens + response.usage.output_tokens
        })

    return results

def save_results(results: List[Dict], filename: str):
    with open(filename, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["input", "output", "score", "tokens"])
        writer.writeheader()
        for r in results:
            writer.writerow({
                "input": json.dumps(r["input"]),
                "output": r["output"],
                "score": r["scores"].get("overall", 0),
                "tokens": r["tokens"]
            })

Using it:

# Define your test cases
test_cases = [
    {
        "query": "What is the GST rate on software services?",
        "expected_answer": "18%",
        "context": "Standard GST rates document"
    },
    # Add 20-50 cases to get statistically meaningful results
]

# Define your prompt function
def build_prompt(case: Dict) -> str:
    return f"Answer this GST query concisely: {case['query']}"

# Define your eval function (can combine rule-based and model-based)
def eval_response(output: str, case: Dict) -> Dict:
    rule_score = eval_contains_answer(output, case["expected_answer"])
    quality_score = judge_response(output, case["query"])
    return {
        "rule_based": rule_score,
        "quality": quality_score,
        "overall": (rule_score + quality_score) / 2
    }

results = run_eval(test_cases, build_prompt, eval_response)
save_results(results, "eval_results_2026_04_14.csv")

This is the skeleton. The eval_fn is where all the interesting work happens.

Claude-as-judge: how to write the judge prompt

The judge prompt is the most important thing to get right. A bad judge will give you meaningless scores and false confidence.

JUDGE_PROMPT = """You are evaluating an AI assistant's response to a customer support query.

Task context: {task_description}

Evaluation criteria:
1. Accuracy (1-5): Does the response correctly address the customer's issue?
2. Completeness (1-5): Are all aspects of the query addressed?
3. Tone (1-5): Is it professional and empathetic?
4. Actionability (1-5): Does the customer know exactly what to do next?

Customer query: {query}
AI response: {response}

Score each criterion 1-5. Be strict — a 5 means genuinely excellent, not just adequate.
Return JSON only, no other text:
{{"accuracy": N, "completeness": N, "tone": N, "actionability": N, "overall": N, "reasoning": "one sentence"}}"""

def judge_response(output: str, query: str, task_description: str = "") -> Dict:
    judge_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        output_config={
            "format": {
                "type": "json_schema",
                "json_schema": {
                    "name": "eval_scores",
                    "schema": {
                        "type": "object",
                        "properties": {
                            "accuracy": {"type": "number"},
                            "completeness": {"type": "number"},
                            "tone": {"type": "number"},
                            "actionability": {"type": "number"},
                            "overall": {"type": "number"},
                            "reasoning": {"type": "string"}
                        },
                        "required": ["accuracy", "completeness", "tone", "actionability", "overall", "reasoning"]
                    }
                }
            }
        },
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                task_description=task_description,
                query=query,
                response=output
            )
        }]
    )
    return json.loads(judge_response.content[0].text)

Calibrating the judge: run 50 test cases through your judge. Then manually rate the same 50 cases. Compare. If your judge's scores correlate with your human scores (Spearman's r > 0.7), the judge is usable. If it's lower, your judge prompt needs work — usually the criteria are too vague or the 1-5 scale isn't well anchored.

A good calibration trick: include anchor examples in the judge prompt. "A score of 5 for accuracy means the response contains no factual errors and directly answers the question. A score of 1 means the response is factually wrong or completely misses the question."

Free tools for Indian developers

Langfuse open source — self-host on a ₹400-600/month VPS

Langfuse is the cleanest open-source LLM observability tool. It tracks every LLM call, stores inputs and outputs, lets you build eval datasets from production traces, and shows cost trends over time. Self-hosting on Hostinger or DigitalOcean India region runs around ₹400-600/month (roughly $5-7 USD).

# docker-compose.yml
version: "3"
services:
  langfuse-server:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://langfuse:password@db:5432/langfuse
      NEXTAUTH_SECRET: your-secret-here
      NEXTAUTH_URL: http://your-vps-ip:3000
      SALT: your-salt-here
  db:
    image: postgres:15
    environment:
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: password
      POSTGRES_DB: langfuse
    volumes:
      - postgres_data:/var/lib/postgresql/data
volumes:
  postgres_data:

Then in your Python code:

from langfuse import Langfuse
langfuse = Langfuse(public_key="...", secret_key="...", host="http://your-vps-ip:3000")

The real value: once you have production traffic flowing through Langfuse, you can tag specific traces as eval examples, build datasets from real failures, and run your eval suite against those datasets. This closes the loop between production failures and eval coverage.

Weights & Biases free tier — 100GB storage, free forever

W&B is overkill for basic eval logging but excellent once you're comparing across multiple prompt versions or model configurations. The free tier gives you 100GB storage and unlimited runs. Log your eval results:

import wandb

wandb.init(project="gst-assistant-evals", config={
    "model": "claude-sonnet-4-6",
    "effort": "medium",
    "prompt_version": "v2.3"
})

results = run_eval(test_cases, build_prompt, eval_response)

for i, result in enumerate(results):
    wandb.log({
        "overall_score": result["scores"]["overall"],
        "tokens": result["tokens"],
        "step": i
    })

# Log aggregate metrics
scores = [r["scores"]["overall"] for r in results]
wandb.log({
    "mean_score": sum(scores) / len(scores),
    "pass_rate": sum(1 for s in scores if s >= 0.7) / len(scores),
    "total_tokens": sum(r["tokens"] for r in results)
})

wandb.finish()

Plain CSV + Google Sheets

Not glamorous, but works for teams of 1-3. Export eval results to CSV, import to Google Sheets, use a pivot table to compare versions. For most early-stage products, this is genuinely enough. Don't over-engineer until you have volume that justifies it.

What to measure

Metric	How to measure	Why it matters
Task completion rate	Rule-based: did it produce the required output format?	Baseline reliability
Hallucination rate	Model-based: does response contain claims not in context?	Trust
Instruction following	Rule-based + model-based: did it follow all N constraints?	Production reliability
Consistency	Run same prompt 10x, measure variance in scores	Reproducibility
Latency P50/P99	Time from request to first token	UX
Cost per successful call	Total tokens / pass rate	Business viability

For the hallucination metric specifically, the judge prompt matters a lot. The best pattern: provide the source documents, the query, and the response. Ask the judge to identify any claims in the response that aren't supported by the source documents.

India context: evaluating Hindi/Hinglish outputs

If your product serves Hindi speakers or uses Hinglish, add a language quality dimension to your judge prompt. The criteria that matter:

Natural code-mixing: does the response mix languages the way your target users actually speak, or does it feel like translated English? Real Hinglish isn't just inserting Hindi words — it has specific grammatical patterns.

Formality calibration: formal Hindi (आप) vs casual Hindi (तुम/तू) vs Hinglish varies by context. A banking chatbot should use formal Hindi. A consumer app might use casual Hinglish.

Regional markers: Mumbai Hindi sounds different from Delhi Hindi sounds different from Hyderabadi Hindi. If you're targeting a specific city, your judge should reflect that.

A practical Hindi eval criterion to add to your judge prompt: "Rate the Hindi/Hinglish naturalness 1-5. A 5 means a native Hindi speaker would read this as natural. A 3 means it's comprehensible but slightly stilted. A 1 means it reads as direct translation."

How to use eval results to choose between effort levels

This is one of the most practically useful things evals unlock. Run your eval suite at effort=low, effort=medium, and effort=high. Plot task completion rate and mean quality score against cost per call.

In my experience across customer support, document extraction, and content generation tasks:

effort=low matches effort=medium for simple, well-defined tasks (format conversion, extraction from clean documents)
effort=medium is the elbow point for most tasks — meaningfully better than low, not significantly worse than high
effort=high only justifies its cost for complex reasoning tasks where errors have real business consequences

Set your production effort level at the elbow. Run evals quarterly to check if it's shifted — as you improve your prompts, tasks that needed effort=medium may work fine at effort=low.

💡 Want to go deeper? The Advanced track covers evaluation frameworks as part of the prompt engineering curriculum, including how to build automated regression tests for prompts.

Next steps

Claude 4.6 effort parameter and cost optimization — detailed cost/quality tradeoffs across effort levels
Prompt caching and API cost reduction — reduce eval costs by 80% with prefix caching
Evaluation frameworks lesson — the theory behind what to measure and why
Claude Opus 4.6 prompting guide — when to use Opus as your judge model

A minimal eval harness takes 50 lines of Python and 2 hours to set up. Here's exactly how.

The three types of evals

Rule-based evals

Regex patterns, exact match, substring checks. Cheap, fast, deterministic. These run in milliseconds and cost nothing in API calls.

Use rule-based evals for:

Format validation: is this valid JSON? Does the output parse?
Constraint checking: did the response stay under 200 words?
Critical content checks: does it mention the price? Does it contain required disclaimers?
Structure checks: are all required sections present?

import json
import re

def eval_json_validity(output: str) -> dict:
    try:
        data = json.loads(output)
        return {"pass": True, "score": 1.0}
    except json.JSONDecodeError as e:
        return {"pass": False, "score": 0.0, "error": str(e)}

def eval_word_count(output: str, max_words: int = 200) -> dict:
    words = len(output.split())
    return {
        "pass": words <= max_words,
        "score": min(1.0, max_words / max(words, 1)),
        "word_count": words
    }

def eval_contains_gstin(output: str) -> dict:
    # GSTIN pattern: 2 digits + 10 char PAN + 1 digit + Z + 1 char
    pattern = r'\d{2}[A-Z]{5}\d{4}[A-Z]{1}\d[Z]{1}[A-Z\d]{1}'
    found = bool(re.search(pattern, output))
    return {"pass": found, "score": 1.0 if found else 0.0}

Model-based evals (Claude-as-judge)

Use model-based evals for:

Subjective quality: is this response actually helpful?
Complex criteria: did it follow all 5 instructions from the system prompt?
Comparing versions: is v2 of this prompt better than v1?
Detecting subtle failures: did the model hallucinate a regulation that doesn't exist?

Human evals

The simplest implementation: a CSV with input, output, and a score column (1-5). Share it via Google Sheets with the person doing the rating. Aggregate weekly.

Building a minimal eval harness in Python (50 lines)

import json
import csv
from anthropic import Anthropic
from typing import Callable, List, Dict, Any

client = Anthropic()

def run_eval(
    test_cases: List[Dict],
    prompt_fn: Callable[[Dict], str],
    eval_fn: Callable[[str, Dict], Dict],
    model: str = "claude-sonnet-4-6",
    effort: str = "medium"
) -> List[Dict]:
    results = []
    for case in test_cases:
        # Generate response
        response = client.messages.create(
            model=model,
            effort=effort,
            max_tokens=2000,
            messages=[{"role": "user", "content": prompt_fn(case)}]
        )
        output = response.content[0].text

        # Evaluate
        scores = eval_fn(output, case)

        results.append({
            "input": case,
            "output": output,
            "scores": scores,
            "tokens": response.usage.input_tokens + response.usage.output_tokens
        })

    return results

def save_results(results: List[Dict], filename: str):
    with open(filename, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["input", "output", "score", "tokens"])
        writer.writeheader()
        for r in results:
            writer.writerow({
                "input": json.dumps(r["input"]),
                "output": r["output"],
                "score": r["scores"].get("overall", 0),
                "tokens": r["tokens"]
            })

Using it:

# Define your test cases
test_cases = [
    {
        "query": "What is the GST rate on software services?",
        "expected_answer": "18%",
        "context": "Standard GST rates document"
    },
    # Add 20-50 cases to get statistically meaningful results
]

# Define your prompt function
def build_prompt(case: Dict) -> str:
    return f"Answer this GST query concisely: {case['query']}"

# Define your eval function (can combine rule-based and model-based)
def eval_response(output: str, case: Dict) -> Dict:
    rule_score = eval_contains_answer(output, case["expected_answer"])
    quality_score = judge_response(output, case["query"])
    return {
        "rule_based": rule_score,
        "quality": quality_score,
        "overall": (rule_score + quality_score) / 2
    }

results = run_eval(test_cases, build_prompt, eval_response)
save_results(results, "eval_results_2026_04_14.csv")

This is the skeleton. The eval_fn is where all the interesting work happens.

Claude-as-judge: how to write the judge prompt

The judge prompt is the most important thing to get right. A bad judge will give you meaningless scores and false confidence.

JUDGE_PROMPT = """You are evaluating an AI assistant's response to a customer support query.

Task context: {task_description}

Evaluation criteria:
1. Accuracy (1-5): Does the response correctly address the customer's issue?
2. Completeness (1-5): Are all aspects of the query addressed?
3. Tone (1-5): Is it professional and empathetic?
4. Actionability (1-5): Does the customer know exactly what to do next?

Customer query: {query}
AI response: {response}

Score each criterion 1-5. Be strict — a 5 means genuinely excellent, not just adequate.
Return JSON only, no other text:
{{"accuracy": N, "completeness": N, "tone": N, "actionability": N, "overall": N, "reasoning": "one sentence"}}"""

def judge_response(output: str, query: str, task_description: str = "") -> Dict:
    judge_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        output_config={
            "format": {
                "type": "json_schema",
                "json_schema": {
                    "name": "eval_scores",
                    "schema": {
                        "type": "object",
                        "properties": {
                            "accuracy": {"type": "number"},
                            "completeness": {"type": "number"},
                            "tone": {"type": "number"},
                            "actionability": {"type": "number"},
                            "overall": {"type": "number"},
                            "reasoning": {"type": "string"}
                        },
                        "required": ["accuracy", "completeness", "tone", "actionability", "overall", "reasoning"]
                    }
                }
            }
        },
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                task_description=task_description,
                query=query,
                response=output
            )
        }]
    )
    return json.loads(judge_response.content[0].text)

Free tools for Indian developers

Langfuse open source — self-host on a ₹400-600/month VPS

# docker-compose.yml
version: "3"
services:
  langfuse-server:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://langfuse:password@db:5432/langfuse
      NEXTAUTH_SECRET: your-secret-here
      NEXTAUTH_URL: http://your-vps-ip:3000
      SALT: your-salt-here
  db:
    image: postgres:15
    environment:
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: password
      POSTGRES_DB: langfuse
    volumes:
      - postgres_data:/var/lib/postgresql/data
volumes:
  postgres_data:

Then in your Python code:

from langfuse import Langfuse
langfuse = Langfuse(public_key="...", secret_key="...", host="http://your-vps-ip:3000")

Weights & Biases free tier — 100GB storage, free forever

import wandb

wandb.init(project="gst-assistant-evals", config={
    "model": "claude-sonnet-4-6",
    "effort": "medium",
    "prompt_version": "v2.3"
})

results = run_eval(test_cases, build_prompt, eval_response)

for i, result in enumerate(results):
    wandb.log({
        "overall_score": result["scores"]["overall"],
        "tokens": result["tokens"],
        "step": i
    })

# Log aggregate metrics
scores = [r["scores"]["overall"] for r in results]
wandb.log({
    "mean_score": sum(scores) / len(scores),
    "pass_rate": sum(1 for s in scores if s >= 0.7) / len(scores),
    "total_tokens": sum(r["tokens"] for r in results)
})

wandb.finish()

Plain CSV + Google Sheets

What to measure

Metric	How to measure	Why it matters
Task completion rate	Rule-based: did it produce the required output format?	Baseline reliability
Hallucination rate	Model-based: does response contain claims not in context?	Trust
Instruction following	Rule-based + model-based: did it follow all N constraints?	Production reliability
Consistency	Run same prompt 10x, measure variance in scores	Reproducibility
Latency P50/P99	Time from request to first token	UX
Cost per successful call	Total tokens / pass rate	Business viability

India context: evaluating Hindi/Hinglish outputs

If your product serves Hindi speakers or uses Hinglish, add a language quality dimension to your judge prompt. The criteria that matter:

Formality calibration: formal Hindi (आप) vs casual Hindi (तुम/तू) vs Hinglish varies by context. A banking chatbot should use formal Hindi. A consumer app might use casual Hinglish.

Regional markers: Mumbai Hindi sounds different from Delhi Hindi sounds different from Hyderabadi Hindi. If you're targeting a specific city, your judge should reflect that.

How to use eval results to choose between effort levels

In my experience across customer support, document extraction, and content generation tasks:

effort=low matches effort=medium for simple, well-defined tasks (format conversion, extraction from clean documents)
effort=medium is the elbow point for most tasks — meaningfully better than low, not significantly worse than high
effort=high only justifies its cost for complex reasoning tasks where errors have real business consequences

Set your production effort level at the elbow. Run evals quarterly to check if it's shifted — as you improve your prompts, tasks that needed effort=medium may work fine at effort=low.

💡 Want to go deeper? The Advanced track covers evaluation frameworks as part of the prompt engineering curriculum, including how to build automated regression tests for prompts.

Next steps

Claude 4.6 effort parameter and cost optimization — detailed cost/quality tradeoffs across effort levels
Prompt caching and API cost reduction — reduce eval costs by 80% with prefix caching
Evaluation frameworks lesson — the theory behind what to measure and why
Claude Opus 4.6 prompting guide — when to use Opus as your judge model

How to Evaluate Your LLM Outputs: A Practical Eval Framework for Indian Developers

The three types of evals

Rule-based evals

Model-based evals (Claude-as-judge)

Human evals

Building a minimal eval harness in Python (50 lines)

Claude-as-judge: how to write the judge prompt

Free tools for Indian developers

Langfuse open source — self-host on a ₹400-600/month VPS

Weights & Biases free tier — 100GB storage, free forever

Plain CSV + Google Sheets

What to measure

India context: evaluating Hindi/Hinglish outputs

How to use eval results to choose between effort levels

Next steps

Related articles

AI Engineering Career Roadmap for Indian Developers: SDET/Backend to LLM Engineer in 6 Months

25 AI Prompts for Indian Startup Founders: Product, Pitch Deck, Investor Emails, and GTM

Anthropic's Claude for Open Source: How Indian Developers Can Get Claude Max Free

How to Evaluate Your LLM Outputs: A Practical Eval Framework for Indian Developers

The three types of evals

Rule-based evals

Model-based evals (Claude-as-judge)

Human evals

Building a minimal eval harness in Python (50 lines)

Claude-as-judge: how to write the judge prompt

Free tools for Indian developers

Langfuse open source — self-host on a ₹400-600/month VPS

Weights & Biases free tier — 100GB storage, free forever

Plain CSV + Google Sheets

What to measure

India context: evaluating Hindi/Hinglish outputs

How to use eval results to choose between effort levels

Next steps

Related articles

AI Engineering Career Roadmap for Indian Developers: SDET/Backend to LLM Engineer in 6 Months

25 AI Prompts for Indian Startup Founders: Product, Pitch Deck, Investor Emails, and GTM

Anthropic's Claude for Open Source: How Indian Developers Can Get Claude Max Free