What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Promptfoo Tutorial — Test Your LLM Prompts Before They Break in Production

You tweak a system prompt to improve tone. The change looks fine in your test conversation. You deploy. Three days later a user emails you because the output is now sometimes a numbered list instead of JSON — and your downstream parser is silently swallowing malformed data.

This is the most common failure mode in LLM applications. Prompts aren't like regular code: they're fuzzy, and changes that look harmless can shift model behavior in ways that only show up on certain inputs. Without a test suite, you're flying blind every time you touch a prompt.

Promptfoo fixes this. It's an open-source CLI and library for evaluating LLM prompts against test cases — like Jest, but for your prompts. You define inputs, assertions, and providers. Promptfoo runs every combination and tells you what passed and what broke. It's the closest thing to a regression test suite for prompt engineering.

Getting started with Promptfoo

Install it globally or run with npx:

npm install -g promptfoo
# or skip install and use npx
npx promptfoo@latest init

init scaffolds a promptfooconfig.yaml in your current directory. That file is the center of everything in Promptfoo testing.

Here's a minimal config:

prompts:
  - "Summarize the following text in 3 bullet points:\n\n{{text}}"

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      apiKey: ${ANTHROPIC_API_KEY}

tests:
  - vars:
      text: "The Eiffel Tower was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair..."
    assert:
      - type: contains
        value: "1889"
      - type: llm-rubric
        value: "Response contains exactly 3 bullet points and stays under 100 words"
      - type: not-contains
        value: "I cannot"

Run it:

promptfoo eval

Promptfoo calls the provider for each test case, checks every assertion, and prints a pass/fail table to your terminal. Then run promptfoo view to open a browser UI with a detailed comparison grid.

Comparing multiple models simultaneously

One of Promptfoo's best features is multi-provider evaluation. You add multiple providers and every test runs against all of them in parallel:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      apiKey: ${ANTHROPIC_API_KEY}
  - openai:gpt-4o
  - anthropic:messages:claude-haiku-4-5

Now a single promptfoo eval run shows you how Claude Sonnet, GPT-4o, and Claude Haiku handle every test case side by side. This is how you find the cheapest model that still passes all your quality checks — run the evals, look at which models fail, and eliminate them.

I've used this exact workflow to justify switching from GPT-4o to Claude Haiku on a classification task that ran millions of times per month. Haiku passed 97% of the same tests at roughly 10x lower cost. Without the eval, I would have kept assuming GPT-4o was necessary.

Promptfoo assertion types

The assertion system is what makes Promptfoo actually useful. Here's what you'll reach for most often:

contains / not-contains

The simplest checks. Did the output include (or exclude) this exact string?

assert:
  - type: contains
    value: "```json"
  - type: not-contains
    value: "I'm sorry"

Good for format requirements and banned phrases.

regex

When you need pattern matching instead of exact strings:

assert:
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # ISO date format

javascript

Arbitrary JS function that receives output (the model's response as a string) and returns true or false:

assert:
  - type: javascript
    value: "output.startsWith('-') || output.startsWith('•') || output.startsWith('*')"

You can also return an object with { pass: boolean, score: number, reason: string } for weighted scoring.

llm-rubric

This is the most powerful assertion type for subjective quality. Promptfoo sends your output plus a rubric to an LLM (GPT-4o by default, configurable) and asks it to judge whether the output passes:

assert:
  - type: llm-rubric
    value: "The response is professional in tone, avoids jargon, and is under 150 words"

The judge model returns pass/fail with a reason. You'll use this for things that are genuinely hard to check with string matching — tone, completeness, conciseness, factual accuracy against a reference.

similar

Semantic similarity against an expected output. Useful when you care about meaning but not exact wording:

assert:
  - type: similar
    value: "The customer should receive a full refund within 3-5 business days"
    threshold: 0.8

cost and latency

Non-functional requirements:

assert:
  - type: cost
    threshold: 0.002  # Under $0.002 per call
  - type: latency
    threshold: 3000   # Under 3 seconds

These are useful when you're trying to keep a high-traffic feature under budget.

Reading external test data

For anything beyond toy examples, you'll want to load test inputs from files rather than hardcoding them in the YAML:

tests:
  - vars:
      text: "{{file://test-inputs/article-1.txt}}"
    assert:
      - type: contains
        value: "1889"

  - vars:
      text: "{{file://test-inputs/article-2.txt}}"
    assert:
      - type: llm-rubric
        value: "Response contains exactly 3 bullet points"

You can also load entire test suites from CSV or JSON files, which makes it easy to build up a library of regression cases over time.

LLM-as-judge: the most important eval pattern

When you're evaluating anything subjective — tone, helpfulness, accuracy, safety — string matching will miss failures and LLM-as-judge will catch them.

The pattern: your output goes to a judge model along with a rubric. The judge returns a structured pass/fail verdict. Promptfoo handles all of this automatically when you use llm-rubric.

defaultTest:
  options:
    provider: openai:gpt-4o  # Use GPT-4o as the judge

tests:
  - vars:
      user_query: "How do I cancel my subscription?"
    assert:
      - type: llm-rubric
        value: "Response correctly explains the cancellation process, is empathetic in tone, and doesn't mention competitors"

The catch: LLM-as-judge adds cost and latency to your eval runs. Use it selectively — for your highest-value assertions, not as a replacement for simple string checks.

For a deeper look at evaluation methodology, see my post on LLM evaluation frameworks.

CI integration with GitHub Actions

Running evals manually before every deploy is too slow and easy to skip. The right setup runs evals automatically on pull requests.

name: Prompt eval

on:
  pull_request:
    paths:
      - "prompts/**"
      - "promptfooconfig.yaml"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run prompt evals
        run: npx promptfoo eval --ci
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The --ci flag makes Promptfoo exit with a non-zero code if any assertions fail, which blocks the merge. You can scope the trigger to only run when prompt files change — otherwise every PR kicks off an eval that burns API budget on unrelated code changes.

This workflow is what prompt versioning and production management looks like in practice: your prompts live in version control, and changes get tested before they reach users.

Red-teaming with auto-generated adversarial inputs

Promptfoo can generate adversarial test cases automatically:

promptfoo generate dataset --purpose "customer support chatbot for a SaaS product"

This uses an LLM to generate inputs designed to elicit failures — prompt injections, out-of-scope requests, attempts to get the model to reveal its system prompt, jailbreak attempts. The generated cases get added to your test suite.

Red-teaming is especially useful before shipping something that handles user-generated input. A few minutes of automated adversarial testing will surface edge cases your manual tests missed.

Promptfoo vs LangSmith vs Braintrust

These tools get mentioned together but they solve different problems:

Promptfoo is open source, runs from the CLI, and is built around offline evaluation against test cases. No account required, no data leaves your machine by default, integrates naturally into CI. Best fit: teams that want automated regression testing on prompt changes without adding infrastructure.

LangSmith is built into the LangChain ecosystem. Its strength is production tracing — you instrument your app and LangSmith captures every LLM call with inputs, outputs, latency, and cost. You can then build evals on top of that trace data. Best fit: LangChain users who want to trace production behavior and run evals against real traffic.

Braintrust is aimed at teams with annotation workflows. It has a UI for human reviewers to label model outputs, and it handles dataset management, experiment tracking, and A/B testing at the team level. Best fit: organizations that need structured human evaluation alongside automated evals, or that are running systematic A/B testing on prompts in production.

For most developer-built LLM applications, Promptfoo is the right starting point. It's free, fast to set up, and the CI integration catches regressions before they ship. You can add LangSmith for production tracing later without replacing Promptfoo.

A complete promptfooconfig.yaml example

Here's what a real config looks like for a product description generator — the kind of thing that runs in a production pipeline and needs to stay reliable:

prompts:
  - file://prompts/product-description-v3.txt

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      apiKey: ${ANTHROPIC_API_KEY}
      max_tokens: 512
  - id: anthropic:messages:claude-haiku-4-5
    config:
      apiKey: ${ANTHROPIC_API_KEY}
      max_tokens: 512

defaultTest:
  assert:
    - type: not-contains
      value: "I cannot"
    - type: not-contains
      value: "As an AI"
    - type: latency
      threshold: 5000

tests:
  - vars:
      product_name: "Noise-cancelling headphones"
      key_features: "40hr battery, ANC, USB-C charging"
      target_audience: "remote workers"
    assert:
      - type: contains
        value: "noise-cancelling"
      - type: llm-rubric
        value: "Description is 50-80 words, mentions battery life, and includes a clear benefit statement"
      - type: cost
        threshold: 0.003

  - vars:
      product_name: "Standing desk converter"
      key_features: "adjustable height, fits standard desks, 25kg capacity"
      target_audience: "office workers"
    assert:
      - type: javascript
        value: "output.length >= 100 && output.length <= 500"
      - type: llm-rubric
        value: "Description targets the benefits of standing work without making unsubstantiated health claims"

The defaultTest.assert block applies to every test case, which saves you from repeating the same basic checks everywhere.

What Promptfoo won't tell you

Promptfoo tests prompts against test cases you've written. If your test cases don't represent your real input distribution, passing evals give you false confidence. The test suite is only as good as the edge cases you've thought to include.

The solution is to pull real production inputs into your test set. Sample 50-100 actual user inputs, run them through Promptfoo's browser UI, manually label which outputs are good and bad, then convert the bad ones into failing test cases. Now your eval suite is anchored in real failure modes instead of hypothetical ones.

For a broader framework on how to think about this — what to measure and why — see the AI agent production checklist.

Getting started today

If you have a prompt that runs in production and you haven't tested it: start with five test cases. Just five. Pick inputs that represent the most common user requests and the most likely failure modes. Add contains assertions for format requirements and one llm-rubric assertion for quality.

npx promptfoo@latest init
# edit promptfooconfig.yaml
promptfoo eval
promptfoo view

The whole setup takes under 30 minutes. After that, every prompt change you make is testable. That's the difference between shipping with confidence and shipping and hoping.

Getting started with Promptfoo

Install it globally or run with npx:

npm install -g promptfoo
# or skip install and use npx
npx promptfoo@latest init

init scaffolds a promptfooconfig.yaml in your current directory. That file is the center of everything in Promptfoo testing.

Here's a minimal config:

prompts:
  - "Summarize the following text in 3 bullet points:\n\n{{text}}"

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      apiKey: ${ANTHROPIC_API_KEY}

tests:
  - vars:
      text: "The Eiffel Tower was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair..."
    assert:
      - type: contains
        value: "1889"
      - type: llm-rubric
        value: "Response contains exactly 3 bullet points and stays under 100 words"
      - type: not-contains
        value: "I cannot"

Run it:

promptfoo eval

Promptfoo calls the provider for each test case, checks every assertion, and prints a pass/fail table to your terminal. Then run promptfoo view to open a browser UI with a detailed comparison grid.

Comparing multiple models simultaneously

One of Promptfoo's best features is multi-provider evaluation. You add multiple providers and every test runs against all of them in parallel:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      apiKey: ${ANTHROPIC_API_KEY}
  - openai:gpt-4o
  - anthropic:messages:claude-haiku-4-5

Promptfoo assertion types

The assertion system is what makes Promptfoo actually useful. Here's what you'll reach for most often:

contains / not-contains

The simplest checks. Did the output include (or exclude) this exact string?

assert:
  - type: contains
    value: "```json"
  - type: not-contains
    value: "I'm sorry"

Good for format requirements and banned phrases.

regex

When you need pattern matching instead of exact strings:

assert:
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # ISO date format

javascript

Arbitrary JS function that receives output (the model's response as a string) and returns true or false:

assert:
  - type: javascript
    value: "output.startsWith('-') || output.startsWith('•') || output.startsWith('*')"

You can also return an object with { pass: boolean, score: number, reason: string } for weighted scoring.

llm-rubric

This is the most powerful assertion type for subjective quality. Promptfoo sends your output plus a rubric to an LLM (GPT-4o by default, configurable) and asks it to judge whether the output passes:

assert:
  - type: llm-rubric
    value: "The response is professional in tone, avoids jargon, and is under 150 words"

similar

Semantic similarity against an expected output. Useful when you care about meaning but not exact wording:

assert:
  - type: similar
    value: "The customer should receive a full refund within 3-5 business days"
    threshold: 0.8

cost and latency

Non-functional requirements:

assert:
  - type: cost
    threshold: 0.002  # Under $0.002 per call
  - type: latency
    threshold: 3000   # Under 3 seconds

These are useful when you're trying to keep a high-traffic feature under budget.

Reading external test data

For anything beyond toy examples, you'll want to load test inputs from files rather than hardcoding them in the YAML:

tests:
  - vars:
      text: "{{file://test-inputs/article-1.txt}}"
    assert:
      - type: contains
        value: "1889"

  - vars:
      text: "{{file://test-inputs/article-2.txt}}"
    assert:
      - type: llm-rubric
        value: "Response contains exactly 3 bullet points"

You can also load entire test suites from CSV or JSON files, which makes it easy to build up a library of regression cases over time.

LLM-as-judge: the most important eval pattern

When you're evaluating anything subjective — tone, helpfulness, accuracy, safety — string matching will miss failures and LLM-as-judge will catch them.

The pattern: your output goes to a judge model along with a rubric. The judge returns a structured pass/fail verdict. Promptfoo handles all of this automatically when you use llm-rubric.

defaultTest:
  options:
    provider: openai:gpt-4o  # Use GPT-4o as the judge

tests:
  - vars:
      user_query: "How do I cancel my subscription?"
    assert:
      - type: llm-rubric
        value: "Response correctly explains the cancellation process, is empathetic in tone, and doesn't mention competitors"

The catch: LLM-as-judge adds cost and latency to your eval runs. Use it selectively — for your highest-value assertions, not as a replacement for simple string checks.

For a deeper look at evaluation methodology, see my post on LLM evaluation frameworks.

CI integration with GitHub Actions

Running evals manually before every deploy is too slow and easy to skip. The right setup runs evals automatically on pull requests.

name: Prompt eval

on:
  pull_request:
    paths:
      - "prompts/**"
      - "promptfooconfig.yaml"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run prompt evals
        run: npx promptfoo eval --ci
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

This workflow is what prompt versioning and production management looks like in practice: your prompts live in version control, and changes get tested before they reach users.

Red-teaming with auto-generated adversarial inputs

Promptfoo can generate adversarial test cases automatically:

promptfoo generate dataset --purpose "customer support chatbot for a SaaS product"

Red-teaming is especially useful before shipping something that handles user-generated input. A few minutes of automated adversarial testing will surface edge cases your manual tests missed.

Promptfoo vs LangSmith vs Braintrust

These tools get mentioned together but they solve different problems:

A complete promptfooconfig.yaml example

Here's what a real config looks like for a product description generator — the kind of thing that runs in a production pipeline and needs to stay reliable:

prompts:
  - file://prompts/product-description-v3.txt

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      apiKey: ${ANTHROPIC_API_KEY}
      max_tokens: 512
  - id: anthropic:messages:claude-haiku-4-5
    config:
      apiKey: ${ANTHROPIC_API_KEY}
      max_tokens: 512

defaultTest:
  assert:
    - type: not-contains
      value: "I cannot"
    - type: not-contains
      value: "As an AI"
    - type: latency
      threshold: 5000

tests:
  - vars:
      product_name: "Noise-cancelling headphones"
      key_features: "40hr battery, ANC, USB-C charging"
      target_audience: "remote workers"
    assert:
      - type: contains
        value: "noise-cancelling"
      - type: llm-rubric
        value: "Description is 50-80 words, mentions battery life, and includes a clear benefit statement"
      - type: cost
        threshold: 0.003

  - vars:
      product_name: "Standing desk converter"
      key_features: "adjustable height, fits standard desks, 25kg capacity"
      target_audience: "office workers"
    assert:
      - type: javascript
        value: "output.length >= 100 && output.length <= 500"
      - type: llm-rubric
        value: "Description targets the benefits of standing work without making unsubstantiated health claims"

The defaultTest.assert block applies to every test case, which saves you from repeating the same basic checks everywhere.

What Promptfoo won't tell you

For a broader framework on how to think about this — what to measure and why — see the AI agent production checklist.

Getting started today

npx promptfoo@latest init
# edit promptfooconfig.yaml
promptfoo eval
promptfoo view

The whole setup takes under 30 minutes. After that, every prompt change you make is testable. That's the difference between shipping with confidence and shipping and hoping.

Promptfoo Tutorial — Test Your LLM Prompts Before They Break in Production

Getting started with Promptfoo

Comparing multiple models simultaneously

Promptfoo assertion types

Reading external test data

LLM-as-judge: the most important eval pattern

CI integration with GitHub Actions

Red-teaming with auto-generated adversarial inputs

Promptfoo vs LangSmith vs Braintrust

A complete promptfooconfig.yaml example

What Promptfoo won't tell you

Getting started today

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Async Python for LLM Apps — Patterns That Actually Work in Production

50 Best AI Prompts for Claude That Actually Work (2026)

Promptfoo Tutorial — Test Your LLM Prompts Before They Break in Production

Getting started with Promptfoo

Comparing multiple models simultaneously

Promptfoo assertion types

Reading external test data

LLM-as-judge: the most important eval pattern

CI integration with GitHub Actions

Red-teaming with auto-generated adversarial inputs

Promptfoo vs LangSmith vs Braintrust

A complete promptfooconfig.yaml example

What Promptfoo won't tell you

Getting started today

Related articles

A/B Testing Prompts in Production — A Statistical Guide

Async Python for LLM Apps — Patterns That Actually Work in Production

50 Best AI Prompts for Claude That Actually Work (2026)