You tweak a system prompt to improve tone. The change looks fine in your test conversation. You deploy. Three days later a user emails you because the output is now sometimes a numbered list instead of JSON — and your downstream parser is silently swallowing malformed data.
This is the most common failure mode in LLM applications. Prompts aren't like regular code: they're fuzzy, and changes that look harmless can shift model behavior in ways that only show up on certain inputs. Without a test suite, you're flying blind every time you touch a prompt.
Promptfoo fixes this. It's an open-source CLI and library for evaluating LLM prompts against test cases — like Jest, but for your prompts. You define inputs, assertions, and providers. Promptfoo runs every combination and tells you what passed and what broke. It's the closest thing to a regression test suite for prompt engineering.
Getting started with Promptfoo
Install it globally or run with npx:
npm install -g promptfoo
# or skip install and use npx
npx promptfoo@latest init
init scaffolds a promptfooconfig.yaml in your current directory. That file is the center of everything in Promptfoo testing.
Here's a minimal config:
prompts:
- "Summarize the following text in 3 bullet points:\n\n{{text}}"
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
apiKey: ${ANTHROPIC_API_KEY}
tests:
- vars:
text: "The Eiffel Tower was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair..."
assert:
- type: contains
value: "1889"
- type: llm-rubric
value: "Response contains exactly 3 bullet points and stays under 100 words"
- type: not-contains
value: "I cannot"
Run it:
promptfoo eval
Promptfoo calls the provider for each test case, checks every assertion, and prints a pass/fail table to your terminal. Then run promptfoo view to open a browser UI with a detailed comparison grid.
Comparing multiple models simultaneously
One of Promptfoo's best features is multi-provider evaluation. You add multiple providers and every test runs against all of them in parallel:
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
apiKey: ${ANTHROPIC_API_KEY}
- openai:gpt-4o
- anthropic:messages:claude-haiku-4-5
Now a single promptfoo eval run shows you how Claude Sonnet, GPT-4o, and Claude Haiku handle every test case side by side. This is how you find the cheapest model that still passes all your quality checks — run the evals, look at which models fail, and eliminate them.
I've used this exact workflow to justify switching from GPT-4o to Claude Haiku on a classification task that ran millions of times per month. Haiku passed 97% of the same tests at roughly 10x lower cost. Without the eval, I would have kept assuming GPT-4o was necessary.
Promptfoo assertion types
The assertion system is what makes Promptfoo actually useful. Here's what you'll reach for most often:
contains / not-contains
The simplest checks. Did the output include (or exclude) this exact string?
assert:
- type: contains
value: "```json"
- type: not-contains
value: "I'm sorry"
Good for format requirements and banned phrases.
regex
When you need pattern matching instead of exact strings:
assert:
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}" # ISO date format
javascript
Arbitrary JS function that receives output (the model's response as a string) and returns true or false:
assert:
- type: javascript
value: "output.startsWith('-') || output.startsWith('•') || output.startsWith('*')"
You can also return an object with { pass: boolean, score: number, reason: string } for weighted scoring.
llm-rubric
This is the most powerful assertion type for subjective quality. Promptfoo sends your output plus a rubric to an LLM (GPT-4o by default, configurable) and asks it to judge whether the output passes:
assert:
- type: llm-rubric
value: "The response is professional in tone, avoids jargon, and is under 150 words"
The judge model returns pass/fail with a reason. You'll use this for things that are genuinely hard to check with string matching — tone, completeness, conciseness, factual accuracy against a reference.
similar
Semantic similarity against an expected output. Useful when you care about meaning but not exact wording:
assert:
- type: similar
value: "The customer should receive a full refund within 3-5 business days"
threshold: 0.8
cost and latency
Non-functional requirements:
assert:
- type: cost
threshold: 0.002 # Under $0.002 per call
- type: latency
threshold: 3000 # Under 3 seconds
These are useful when you're trying to keep a high-traffic feature under budget.
Reading external test data
For anything beyond toy examples, you'll want to load test inputs from files rather than hardcoding them in the YAML:
tests:
- vars:
text: "{{file://test-inputs/article-1.txt}}"
assert:
- type: contains
value: "1889"
- vars:
text: "{{file://test-inputs/article-2.txt}}"
assert:
- type: llm-rubric
value: "Response contains exactly 3 bullet points"
You can also load entire test suites from CSV or JSON files, which makes it easy to build up a library of regression cases over time.
LLM-as-judge: the most important eval pattern
When you're evaluating anything subjective — tone, helpfulness, accuracy, safety — string matching will miss failures and LLM-as-judge will catch them.
The pattern: your output goes to a judge model along with a rubric. The judge returns a structured pass/fail verdict. Promptfoo handles all of this automatically when you use llm-rubric.
defaultTest:
options:
provider: openai:gpt-4o # Use GPT-4o as the judge
tests:
- vars:
user_query: "How do I cancel my subscription?"
assert:
- type: llm-rubric
value: "Response correctly explains the cancellation process, is empathetic in tone, and doesn't mention competitors"
The catch: LLM-as-judge adds cost and latency to your eval runs. Use it selectively — for your highest-value assertions, not as a replacement for simple string checks.
For a deeper look at evaluation methodology, see my post on LLM evaluation frameworks.
CI integration with GitHub Actions
Running evals manually before every deploy is too slow and easy to skip. The right setup runs evals automatically on pull requests.
name: Prompt eval
on:
pull_request:
paths:
- "prompts/**"
- "promptfooconfig.yaml"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run prompt evals
run: npx promptfoo eval --ci
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The --ci flag makes Promptfoo exit with a non-zero code if any assertions fail, which blocks the merge. You can scope the trigger to only run when prompt files change — otherwise every PR kicks off an eval that burns API budget on unrelated code changes.
This workflow is what prompt versioning and production management looks like in practice: your prompts live in version control, and changes get tested before they reach users.
Red-teaming with auto-generated adversarial inputs
Promptfoo can generate adversarial test cases automatically:
promptfoo generate dataset --purpose "customer support chatbot for a SaaS product"
This uses an LLM to generate inputs designed to elicit failures — prompt injections, out-of-scope requests, attempts to get the model to reveal its system prompt, jailbreak attempts. The generated cases get added to your test suite.
Red-teaming is especially useful before shipping something that handles user-generated input. A few minutes of automated adversarial testing will surface edge cases your manual tests missed.
Promptfoo vs LangSmith vs Braintrust
These tools get mentioned together but they solve different problems:
Promptfoo is open source, runs from the CLI, and is built around offline evaluation against test cases. No account required, no data leaves your machine by default, integrates naturally into CI. Best fit: teams that want automated regression testing on prompt changes without adding infrastructure.
LangSmith is built into the LangChain ecosystem. Its strength is production tracing — you instrument your app and LangSmith captures every LLM call with inputs, outputs, latency, and cost. You can then build evals on top of that trace data. Best fit: LangChain users who want to trace production behavior and run evals against real traffic.
Braintrust is aimed at teams with annotation workflows. It has a UI for human reviewers to label model outputs, and it handles dataset management, experiment tracking, and A/B testing at the team level. Best fit: organizations that need structured human evaluation alongside automated evals, or that are running systematic A/B testing on prompts in production.
For most developer-built LLM applications, Promptfoo is the right starting point. It's free, fast to set up, and the CI integration catches regressions before they ship. You can add LangSmith for production tracing later without replacing Promptfoo.
A complete promptfooconfig.yaml example
Here's what a real config looks like for a product description generator — the kind of thing that runs in a production pipeline and needs to stay reliable:
prompts:
- file://prompts/product-description-v3.txt
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
apiKey: ${ANTHROPIC_API_KEY}
max_tokens: 512
- id: anthropic:messages:claude-haiku-4-5
config:
apiKey: ${ANTHROPIC_API_KEY}
max_tokens: 512
defaultTest:
assert:
- type: not-contains
value: "I cannot"
- type: not-contains
value: "As an AI"
- type: latency
threshold: 5000
tests:
- vars:
product_name: "Noise-cancelling headphones"
key_features: "40hr battery, ANC, USB-C charging"
target_audience: "remote workers"
assert:
- type: contains
value: "noise-cancelling"
- type: llm-rubric
value: "Description is 50-80 words, mentions battery life, and includes a clear benefit statement"
- type: cost
threshold: 0.003
- vars:
product_name: "Standing desk converter"
key_features: "adjustable height, fits standard desks, 25kg capacity"
target_audience: "office workers"
assert:
- type: javascript
value: "output.length >= 100 && output.length <= 500"
- type: llm-rubric
value: "Description targets the benefits of standing work without making unsubstantiated health claims"
The defaultTest.assert block applies to every test case, which saves you from repeating the same basic checks everywhere.
What Promptfoo won't tell you
Promptfoo tests prompts against test cases you've written. If your test cases don't represent your real input distribution, passing evals give you false confidence. The test suite is only as good as the edge cases you've thought to include.
The solution is to pull real production inputs into your test set. Sample 50-100 actual user inputs, run them through Promptfoo's browser UI, manually label which outputs are good and bad, then convert the bad ones into failing test cases. Now your eval suite is anchored in real failure modes instead of hypothetical ones.
For a broader framework on how to think about this — what to measure and why — see the AI agent production checklist.
Getting started today
If you have a prompt that runs in production and you haven't tested it: start with five test cases. Just five. Pick inputs that represent the most common user requests and the most likely failure modes. Add contains assertions for format requirements and one llm-rubric assertion for quality.
npx promptfoo@latest init
# edit promptfooconfig.yaml
promptfoo eval
promptfoo view
The whole setup takes under 30 minutes. After that, every prompt change you make is testable. That's the difference between shipping with confidence and shipping and hoping.



