What is self-consistency prompting?

Self-consistency is a technique where you send the same prompt to an LLM multiple times (usually 5–20 times), generate diverse chain-of-thought reasoning paths, and select the most common final answer by majority vote. It was introduced by Wang et al. (2022) and can double accuracy on complex reasoning benchmarks compared to single-sample chain-of-thought.

Why does sampling multiple responses improve accuracy?

LLMs are stochastic — the same prompt with non-zero temperature produces different reasoning paths. Some paths make errors; others reach the correct answer. When you run many paths and take the majority answer, correct paths outvote incorrect ones. The model's 'true belief' about the answer emerges from the ensemble.

How many samples do I need for self-consistency?

Research shows significant gains from 5 samples, with diminishing returns after 20–40. For most practical use cases, 5–10 samples strike a good balance between accuracy improvement and API cost. If the task is extremely high-stakes, go higher. If answers start converging quickly, you don't need more samples.

When should I NOT use self-consistency?

Self-consistency multiplies your token cost by the number of samples. Skip it for: simple tasks where CoT already achieves high accuracy, latency-sensitive applications, high-volume low-stakes requests, and tasks with no single 'correct' answer (creative writing, open-ended advice). Use it for high-stakes, low-volume reasoning tasks where accuracy matters most.

Self-Consistency: Get Better Answers by Sampling Multiple Reasoning Paths

Self-consistency is a simple idea with surprisingly large accuracy gains: instead of trusting a single chain-of-thought response, generate many and vote on the answer.

The Core Insight

Chain-of-thought prompting works by having the model reason before answering. But a single reasoning chain can go wrong — one misstep early on propagates to a wrong conclusion.

Self-consistency fixes this by sampling many reasoning paths and aggregating their final answers. Correct paths that reach the right conclusion tend to outnumber incorrect paths.

                    Path 1: Reasoning... → Answer: 42
                    Path 2: Reasoning... → Answer: 42
Prompt → LLM × N → Path 3: Reasoning... → Answer: 41  → Majority vote → 42
                    Path 4: Reasoning... → Answer: 42
                    Path 5: Reasoning... → Answer: 42

How to Implement Self-Consistency

Self-consistency doesn't require a special prompt format — you use a standard CoT prompt and call the API multiple times.

Step 1: Write a chain-of-thought prompt

Q: A store had 120 apples. They sold 35 and received a delivery of 80 more.
How many apples do they have now?
Let's work through this step by step.

Step 2: Set temperature > 0 (typically 0.5–1.0)

Zero temperature always returns the same answer, defeating the purpose of sampling multiple paths.

Step 3: Sample N times (5–20 is practical)

responses = []
for _ in range(10):
    response = llm.generate(prompt, temperature=0.7)
    responses.append(response)

Step 4: Extract and aggregate final answers

from collections import Counter

answers = [extract_final_answer(r) for r in responses]
majority_answer = Counter(answers).most_common(1)[0][0]

Accuracy Improvements

Research from Wang et al. (2022) tested self-consistency on mathematical reasoning benchmarks:

Task	Standard CoT	Self-Consistency (40 paths)
GSM8K (math word problems)	56.5%	74.4%
SVAMP (math)	68.9%	82.4%
AQuA (algebra)	47.2%	65.4%

The gains come without any prompt changes — just sampling more and voting.

When Self-Consistency Works Best

Self-consistency is most valuable when:

1. There's a single objectively correct answer Math problems, logic puzzles, factual lookups, code that either runs or doesn't. Majority voting only helps when there's a right answer to converge on.

2. The task has variable difficulty reasoning If a task is always easy or always hard, sampling more doesn't help much. Self-consistency shines when some reasoning paths succeed and others fail — where the model has "partial knowledge" about the correct approach.

3. You have headroom on cost and latency Self-consistency costs N× more than a single call. Use it for high-stakes low-volume tasks.

Practical Self-Consistency Pattern

For production use, you don't need elaborate infrastructure. A simple majority vote works:

def self_consistent_answer(prompt: str, n: int = 10, temperature: float = 0.7) -> str:
    """Generate n responses and return the majority answer."""
    answers = []

    for _ in range(n):
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=1024,
            temperature=temperature,
            messages=[{"role": "user", "content": prompt}]
        )
        answer = extract_answer(response.content[0].text)
        answers.append(answer)

    return Counter(answers).most_common(1)[0][0]

extract_answer is task-specific — you might parse the final number from a math problem, extract a category label, or take the last sentence.

Adaptive Self-Consistency

A cost-saving optimization: start with a few samples and only generate more if answers don't converge.

Sample 3 responses → All agree → Return answer (done, low cost)
Sample 3 responses → 2/3 agree → Sample 2 more → 4/5 agree → Return answer
Sample 3 responses → All different → Sample 7 more → Take majority of 10

This reduces average cost while maintaining accuracy for easy questions.

Self-Consistency vs. Similar Techniques

Technique	What it does	Best for
Self-consistency	Sample many chains, vote	Reasoning with a right answer
Chain of Thought	Show reasoning in one pass	Structured single-response reasoning
Tree of Thought	Explore branching reasoning trees	Complex planning, search problems
Ensemble prompting	Different prompts, same question	Reducing prompt sensitivity

Self-consistency and Tree of Thought are complementary: ToT explores reasoning space deliberately, while self-consistency samples it stochastically.

Key Takeaways

Generate multiple CoT responses with temperature > 0
Aggregate by majority vote on the final answer
Works best for tasks with single correct answers
5–10 samples gives most of the benefit
Not worth it for simple tasks or creative/open-ended output