Automatic Prompt Engineer (APE) flips the script on prompt writing: instead of you crafting and testing prompts by hand, you describe the task and let an LLM search for the best instruction automatically.
The Core Idea
Human prompt engineering is time-consuming and inconsistent. Two engineers writing prompts for the same task often get very different results — and neither one is likely to be optimal.
APE turns prompt optimization into a search problem:
- Generate candidate prompts using an LLM
- Score each candidate on a test set
- Select the highest-scoring prompt
- Optionally iterate — use the best prompt as a starting point for the next generation
The APE Pipeline
Task description + examples
↓
[Generation LLM] → Candidate prompts (20–50 options)
↓
[Score each on eval set]
↓
Ranked candidates
↓
Best prompt → [Optionally iterate]
Step 1: Generating Candidate Prompts
You give the generation model examples of the task and ask it to invent instructions that would produce the correct outputs:
I have a task where given a customer review, I need to extract the main complaint.
Examples:
Input: "The battery died after 4 hours. Completely unusable for travel."
Output: "Short battery life"
Input: "Delivery took 3 weeks and the package arrived damaged."
Output: "Slow shipping, damaged delivery"
Input: "The interface is confusing and I can't find basic settings."
Output: "Confusing interface"
Generate 10 different instruction prompts that would make an AI reliably perform
this task. Write each as a clear instruction to the AI model.
Sample generated candidates:
- "Extract the core complaint from this customer review in 3-5 words."
- "Identify the main problem the customer experienced. Be specific and concise."
- "Read this review and state the primary issue the customer faced."
- "What is the customer's biggest complaint? Answer in a brief phrase."
Step 2: Scoring Candidates
Each candidate prompt is tested against a held-out evaluation set:
def score_prompt(prompt: str, eval_set: list[dict]) -> float:
"""Score a prompt on a labeled evaluation set."""
correct = 0
for example in eval_set:
response = llm.generate(f"{prompt}\n\nReview: {example['input']}")
if is_correct(response, example['expected_output']):
correct += 1
return correct / len(eval_set)
# Score all candidates
scored = [(prompt, score_prompt(prompt, eval_set)) for prompt in candidates]
scored.sort(key=lambda x: x[1], reverse=True)
best_prompt = scored[0][0]
Step 3: Iterative Refinement
The best prompt from round 1 becomes the seed for round 2:
The following prompt achieved 72% accuracy on the task:
"Extract the core complaint from this customer review in 3-5 words."
Generate 10 variations of this prompt that might perform better.
Try: different specificity levels, different framing, explicit examples, format instructions.
This is analogous to evolution: generate variants, select the best, generate variants from that, repeat.
When APE is Worth Using
APE shines when:
| Scenario | Why APE helps |
|---|---|
| High-stakes, high-volume task | Even small accuracy gains compound at scale |
| Task performance is objectively measurable | You can score candidates reliably |
| You have labeled evaluation examples | Required for scoring |
| Current prompt performance is plateauing | Human iteration has diminishing returns |
| Deploying across diverse users | APE finds more robust instructions |
Not worth it for:
- One-off or low-volume tasks
- Creative tasks with no objective metric
- When you have no labeled evaluation data
Practical APE with Modern LLMs
A minimal APE implementation for a classification task:
GENERATION_PROMPT = """
You are a prompt engineer. Given examples of a task, generate {n} different
instruction prompts that an AI model should follow to complete the task.
Task examples:
{examples}
Generate {n} candidate instruction prompts. Write each on a new line, numbered.
"""
def run_ape(task_examples: list, eval_set: list, n_candidates: int = 20, iterations: int = 2):
examples_str = format_examples(task_examples)
# Round 1: Generate from task description
response = llm.generate(GENERATION_PROMPT.format(
n=n_candidates, examples=examples_str
))
candidates = parse_numbered_list(response)
for iteration in range(iterations):
# Score all candidates
scored = [(p, score_prompt(p, eval_set)) for p in candidates]
scored.sort(key=lambda x: x[1], reverse=True)
print(f"Iteration {iteration+1}: Best score = {scored[0][1]:.2%}")
print(f"Best prompt: {scored[0][0]}")
# Generate from top-3 candidates
best_prompts = [p for p, _ in scored[:3]]
candidates = generate_variations(best_prompts, n_candidates)
return scored[0][0] # Return best prompt
APE vs. Manual Prompt Engineering
| Dimension | Manual | APE |
|---|---|---|
| Time to first good prompt | Minutes | Hours (including eval setup) |
| Exploration breadth | Limited by human creativity | 20–50+ candidates per iteration |
| Consistency | Variable | Reproducible |
| Requires labeled data | No | Yes |
| Best for | Quick prototyping | High-stakes production tasks |
Use manual prompting to get a baseline. Use APE to optimize for production when you have the evaluation infrastructure.
Key Takeaways
- APE generates many candidate prompts, scores them on a test set, and picks the winner
- The generation prompt asks an LLM to "invent instructions that produce the right output"
- Iterative refinement (generate → score → generate from best) improves results significantly
- Requires a labeled evaluation set — without it, you can't score candidates objectively
- Most valuable for high-volume production tasks where even 5% accuracy gains matter