What is Automatic Prompt Engineer (APE)?

APE (Zhou et al., 2022) is a framework where an LLM is used to automatically generate candidate instruction prompts for a task, evaluate those candidates on a test set, and select the best-performing one. Instead of manually writing and testing prompts, you describe the task and let the system discover an optimized prompt for you.

How much better are APE-optimized prompts compared to human-written ones?

In the original paper, APE-generated prompts matched or exceeded human-written prompts on 19 out of 24 NLP tasks tested. For tasks with large evaluation datasets, automated optimization consistently finds better-performing prompts than most humans write manually. The improvement varies by task — straightforward tasks show smaller gains, complex reasoning tasks show larger ones.

Do I need a large dataset to use APE?

A small held-out evaluation set (20–100 examples) is enough to rank candidate prompts reliably. The generation phase needs representative input-output examples to describe the task to the LLM. You don't need thousands of examples — but you do need some labeled data to score candidates against.

What's the difference between APE and DSPy?

APE focuses specifically on instruction optimization — finding better natural language instructions for a task. DSPy (from Stanford) is a broader framework for compiling entire LLM pipelines, including prompts, few-shot examples, and chain-of-thought. DSPy uses APE-style optimization internally. Think of APE as a targeted technique and DSPy as a full pipeline optimization framework.

Automatic Prompt Engineer (APE): Let AI Optimize Your Prompts

Automatic Prompt Engineer (APE) flips the script on prompt writing: instead of you crafting and testing prompts by hand, you describe the task and let an LLM search for the best instruction automatically.

The Core Idea

Human prompt engineering is time-consuming and inconsistent. Two engineers writing prompts for the same task often get very different results — and neither one is likely to be optimal.

APE turns prompt optimization into a search problem:

Generate candidate prompts using an LLM
Score each candidate on a test set
Select the highest-scoring prompt
Optionally iterate — use the best prompt as a starting point for the next generation

The APE Pipeline

Task description + examples
      ↓
[Generation LLM] → Candidate prompts (20–50 options)
      ↓
[Score each on eval set]
      ↓
Ranked candidates
      ↓
Best prompt → [Optionally iterate]

Step 1: Generating Candidate Prompts

You give the generation model examples of the task and ask it to invent instructions that would produce the correct outputs:

I have a task where given a customer review, I need to extract the main complaint.

Examples:
Input: "The battery died after 4 hours. Completely unusable for travel."
Output: "Short battery life"

Input: "Delivery took 3 weeks and the package arrived damaged."
Output: "Slow shipping, damaged delivery"

Input: "The interface is confusing and I can't find basic settings."
Output: "Confusing interface"

Generate 10 different instruction prompts that would make an AI reliably perform
this task. Write each as a clear instruction to the AI model.

Sample generated candidates:

"Extract the core complaint from this customer review in 3-5 words."
"Identify the main problem the customer experienced. Be specific and concise."
"Read this review and state the primary issue the customer faced."
"What is the customer's biggest complaint? Answer in a brief phrase."

Step 2: Scoring Candidates

Each candidate prompt is tested against a held-out evaluation set:

def score_prompt(prompt: str, eval_set: list[dict]) -> float:
    """Score a prompt on a labeled evaluation set."""
    correct = 0
    for example in eval_set:
        response = llm.generate(f"{prompt}\n\nReview: {example['input']}")
        if is_correct(response, example['expected_output']):
            correct += 1
    return correct / len(eval_set)

# Score all candidates
scored = [(prompt, score_prompt(prompt, eval_set)) for prompt in candidates]
scored.sort(key=lambda x: x[1], reverse=True)
best_prompt = scored[0][0]

Step 3: Iterative Refinement

The best prompt from round 1 becomes the seed for round 2:

The following prompt achieved 72% accuracy on the task:
"Extract the core complaint from this customer review in 3-5 words."

Generate 10 variations of this prompt that might perform better.
Try: different specificity levels, different framing, explicit examples, format instructions.

This is analogous to evolution: generate variants, select the best, generate variants from that, repeat.

When APE is Worth Using

APE shines when:

Scenario	Why APE helps
High-stakes, high-volume task	Even small accuracy gains compound at scale
Task performance is objectively measurable	You can score candidates reliably
You have labeled evaluation examples	Required for scoring
Current prompt performance is plateauing	Human iteration has diminishing returns
Deploying across diverse users	APE finds more robust instructions

Not worth it for:

One-off or low-volume tasks
Creative tasks with no objective metric
When you have no labeled evaluation data

Practical APE with Modern LLMs

A minimal APE implementation for a classification task:

GENERATION_PROMPT = """
You are a prompt engineer. Given examples of a task, generate {n} different
instruction prompts that an AI model should follow to complete the task.

Task examples:
{examples}

Generate {n} candidate instruction prompts. Write each on a new line, numbered.
"""

def run_ape(task_examples: list, eval_set: list, n_candidates: int = 20, iterations: int = 2):
    examples_str = format_examples(task_examples)

    # Round 1: Generate from task description
    response = llm.generate(GENERATION_PROMPT.format(
        n=n_candidates, examples=examples_str
    ))
    candidates = parse_numbered_list(response)

    for iteration in range(iterations):
        # Score all candidates
        scored = [(p, score_prompt(p, eval_set)) for p in candidates]
        scored.sort(key=lambda x: x[1], reverse=True)

        print(f"Iteration {iteration+1}: Best score = {scored[0][1]:.2%}")
        print(f"Best prompt: {scored[0][0]}")

        # Generate from top-3 candidates
        best_prompts = [p for p, _ in scored[:3]]
        candidates = generate_variations(best_prompts, n_candidates)

    return scored[0][0]  # Return best prompt

APE vs. Manual Prompt Engineering

Dimension	Manual	APE
Time to first good prompt	Minutes	Hours (including eval setup)
Exploration breadth	Limited by human creativity	20–50+ candidates per iteration
Consistency	Variable	Reproducible
Requires labeled data	No	Yes
Best for	Quick prototyping	High-stakes production tasks

Use manual prompting to get a baseline. Use APE to optimize for production when you have the evaluation infrastructure.

Key Takeaways

APE generates many candidate prompts, scores them on a test set, and picks the winner
The generation prompt asks an LLM to "invent instructions that produce the right output"
Iterative refinement (generate → score → generate from best) improves results significantly
Requires a labeled evaluation set — without it, you can't score candidates objectively
Most valuable for high-volume production tasks where even 5% accuracy gains matter