What should I know before starting the intermediate track?

You should be comfortable writing basic prompts and understand that AI models respond to the specificity and structure of your input. If you've worked through the Beginner track — or have a few weeks of hands-on AI experience — you're ready for intermediate techniques.

Who is the intermediate track designed for?

The intermediate track is for people who use AI regularly but want more reliable, professional-quality output. It covers few-shot examples, system prompts, chain-of-thought reasoning, structured output, and working with long documents — the skills that make AI genuinely useful for complex work.

How do intermediate techniques differ from beginner techniques?

Beginner techniques focus on clarity, context, and formatting in individual prompts. Intermediate techniques introduce persistent instructions (system prompts), teaching by example (few-shot prompting), controlling the model's reasoning (chain-of-thought), and forcing structured output — giving you systematic control rather than prompt-by-prompt luck.

Do these intermediate techniques work with all AI models?

Yes — few-shot prompting, chain-of-thought, XML delimiters, and constrained generation work with all major models including ChatGPT, Claude, and Gemini. Some models respond better to certain techniques; the lessons note where model-specific differences matter.

Prompt Testing and Evaluation

Shipping a prompt to production without testing it is like shipping code without running it. Most prompts get written, tried a few times in the playground, and called done. That works until it doesn't — and when it doesn't, you have no way to tell whether a model update, a small wording change, or edge-case input caused the regression.

This lesson covers how to test prompts like code: with golden sets, regression suites, automated evaluation, and metrics that actually tell you if a prompt is getting better or worse.

Why prompt testing is different from code testing

Code tests have deterministic outputs — the same input always produces the same output, and a failing test is a clear signal. Prompt outputs are probabilistic. The same prompt can produce slightly different responses across runs, and "correct" is often subjective.

This changes what testing means. You're not checking for exact string matches — you're checking for:

Correctness: does the output contain the right information?
Format compliance: does the output follow the required structure?
Tone and safety: does it stay within boundaries?
Consistency: does quality hold across a range of inputs?

The goal is a test suite that catches regressions — cases where a prompt change or model update makes things measurably worse — without requiring you to manually read hundreds of outputs.

Building a golden set

A golden set is a collection of (input, expected output) pairs that represent the range of real-world inputs your prompt will encounter.

Building a useful golden set:

Collect real inputs: If your prompt is in production, sample actual user inputs. If it's pre-launch, generate representative synthetic examples.
Cover the distribution: Include typical cases (what 80% of inputs look like), edge cases (empty input, very short, very long, unusual formatting), and adversarial cases (inputs that try to break the expected output format).
Write expected outputs carefully: For factual tasks, write exact expected answers. For generative tasks, write criteria instead of exact answers: "should contain X", "should not contain Y", "should be under 200 words".
Size the set: 20-50 examples is usually enough to catch regressions. More is better for critical production prompts, but 20 beats zero.

golden_set = [
    {
        "input": "Classify this support ticket: 'I can't log in to my account'",
        "expected_category": "Account Access",
        "must_not_contain": ["billing", "refund"],
        "max_length": 50
    },
    {
        "input": "Classify this support ticket: ''",  # edge case: empty input
        "expected_behavior": "should ask for more information or return 'Unknown'",
        "must_contain": ["unclear", "more information", "unknown"]
    },
    # ... more examples
]

Running evaluations automatically

The simplest evaluation approach: run each golden set input through your prompt, then check the output against your criteria.

import anthropic

client = anthropic.Anthropic()

def evaluate_prompt(system_prompt: str, golden_set: list) -> dict:
    results = []

    for example in golden_set:
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=200,
            system=system_prompt,
            messages=[{"role": "user", "content": example["input"]}]
        )
        output = response.content[0].text

        # Check criteria
        checks = {}
        if "expected_category" in example:
            checks["category_correct"] = example["expected_category"].lower() in output.lower()
        if "must_contain" in example:
            checks["contains_required"] = all(
                term.lower() in output.lower()
                for term in example["must_contain"]
            )
        if "must_not_contain" in example:
            checks["no_forbidden"] = not any(
                term.lower() in output.lower()
                for term in example["must_not_contain"]
            )
        if "max_length" in example:
            checks["within_length"] = len(output) <= example["max_length"]

        results.append({
            "input": example["input"],
            "output": output,
            "checks": checks,
            "passed": all(checks.values())
        })

    total = len(results)
    passed = sum(1 for r in results if r["passed"])

    return {
        "pass_rate": passed / total,
        "passed": passed,
        "total": total,
        "failures": [r for r in results if not r["passed"]]
    }

Run this every time you change your prompt. If your pass rate drops, the change is a regression.

LLM-as-judge evaluation

For tasks where output quality can't be checked with string matching — writing quality, reasoning depth, tone adherence — you can use an LLM to evaluate the output:

def judge_output(prompt_output: str, criteria: str) -> dict:
    """Use Claude to evaluate whether an output meets specified criteria."""
    judge_response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=200,
        system="You are an evaluator assessing AI output quality. Be objective and strict.",
        messages=[{
            "role": "user",
            "content": f"""Evaluate this output against the criteria below.

Output to evaluate:
{prompt_output}

Evaluation criteria:
{criteria}

Respond with JSON: {{"score": 1-5, "reasoning": "brief explanation", "pass": true/false}}
Use pass=true only if score >= 4."""
        }]
    )
    return json.loads(judge_response.content[0].text)

LLM-as-judge isn't perfect — the judge model has its own biases — but it's far more scalable than manual review for large test suites. Use a different model as judge than the model being evaluated to reduce self-serving bias.

A/B testing prompt variants

When you're choosing between two prompt approaches, run them both against your golden set and compare:

def compare_prompts(prompt_a: str, prompt_b: str, golden_set: list) -> None:
    results_a = evaluate_prompt(prompt_a, golden_set)
    results_b = evaluate_prompt(prompt_b, golden_set)

    print(f"Prompt A pass rate: {results_a['pass_rate']:.1%}")
    print(f"Prompt B pass rate: {results_b['pass_rate']:.1%}")

    # Show cases where they disagree
    for i, (r_a, r_b) in enumerate(zip(results_a.get('all_results', []),
                                        results_b.get('all_results', []))):
        if r_a['passed'] != r_b['passed']:
            print(f"\nDisagreement on example {i}:")
            print(f"  Input: {r_a['input'][:100]}")
            print(f"  A passed: {r_a['passed']}, B passed: {r_b['passed']}")

This reveals exactly which cases each prompt handles better or worse — not just an overall win/loss.

Regression testing in CI

Once you have a golden set and evaluation script, plug it into your CI pipeline. Every time a prompt changes (if prompts are stored in your codebase), run the evaluation:

# .github/workflows/prompt-eval.yml
name: Prompt evaluation
on:
  push:
    paths:
      - 'prompts/**'
      - 'src/prompts/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run prompt evaluation
        run: python scripts/evaluate_prompts.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Check pass rate
        run: |
          if [ "$PASS_RATE" -lt "90" ]; then
            echo "Prompt evaluation failed: pass rate below 90%"
            exit 1
          fi

This makes prompt regressions as visible as code regressions — they show up as failed builds.

What to measure

The metrics that matter most depend on your use case:

For classification tasks: accuracy, precision, recall per class, confusion matrix

For extraction tasks: field-level accuracy (what % of fields were extracted correctly), false positive rate (fields extracted that shouldn't be)

For generation tasks: format compliance rate, criteria pass rate, latency, token count

For agents: task completion rate, tool call accuracy, error recovery rate, total cost per task

Pick 2-3 metrics that directly reflect whether your prompt is doing its job. Track them over time. A chart of your pass rate across prompt versions tells you immediately whether your iterations are improvements or regressions.

The manual review you can't skip

Automated evaluation catches many things but misses others — especially subtle quality regressions and failure modes you haven't anticipated. Build in regular manual review:

Every time you push a new prompt version, read 20 random outputs side-by-side with the previous version
When pass rates drop, read the failing cases — they reveal the actual failure mode, not just the metric
Monthly: sample 50 production outputs and check for quality drift even if your automated metrics look healthy

Automated evaluation scales. Manual review maintains quality. You need both.

The workflow: write prompt → build golden set → automate evaluation → iterate on prompt → run regression test → deploy → monitor in production. That cycle is slower than "write prompt → ship" but it's the only one that catches problems before users do.