Self-consistency is a simple idea with surprisingly large accuracy gains: instead of trusting a single chain-of-thought response, generate many and vote on the answer.
The Core Insight
Chain-of-thought prompting works by having the model reason before answering. But a single reasoning chain can go wrong — one misstep early on propagates to a wrong conclusion.
Self-consistency fixes this by sampling many reasoning paths and aggregating their final answers. Correct paths that reach the right conclusion tend to outnumber incorrect paths.
Path 1: Reasoning... → Answer: 42
Path 2: Reasoning... → Answer: 42
Prompt → LLM × N → Path 3: Reasoning... → Answer: 41 → Majority vote → 42
Path 4: Reasoning... → Answer: 42
Path 5: Reasoning... → Answer: 42
How to Implement Self-Consistency
Self-consistency doesn't require a special prompt format — you use a standard CoT prompt and call the API multiple times.
Step 1: Write a chain-of-thought prompt
Q: A store had 120 apples. They sold 35 and received a delivery of 80 more.
How many apples do they have now?
Let's work through this step by step.
Step 2: Set temperature > 0 (typically 0.5–1.0)
Zero temperature always returns the same answer, defeating the purpose of sampling multiple paths.
Step 3: Sample N times (5–20 is practical)
responses = []
for _ in range(10):
response = llm.generate(prompt, temperature=0.7)
responses.append(response)
Step 4: Extract and aggregate final answers
from collections import Counter
answers = [extract_final_answer(r) for r in responses]
majority_answer = Counter(answers).most_common(1)[0][0]
Accuracy Improvements
Research from Wang et al. (2022) tested self-consistency on mathematical reasoning benchmarks:
| Task | Standard CoT | Self-Consistency (40 paths) |
|---|---|---|
| GSM8K (math word problems) | 56.5% | 74.4% |
| SVAMP (math) | 68.9% | 82.4% |
| AQuA (algebra) | 47.2% | 65.4% |
The gains come without any prompt changes — just sampling more and voting.
When Self-Consistency Works Best
Self-consistency is most valuable when:
1. There's a single objectively correct answer Math problems, logic puzzles, factual lookups, code that either runs or doesn't. Majority voting only helps when there's a right answer to converge on.
2. The task has variable difficulty reasoning If a task is always easy or always hard, sampling more doesn't help much. Self-consistency shines when some reasoning paths succeed and others fail — where the model has "partial knowledge" about the correct approach.
3. You have headroom on cost and latency Self-consistency costs N× more than a single call. Use it for high-stakes low-volume tasks.
Practical Self-Consistency Pattern
For production use, you don't need elaborate infrastructure. A simple majority vote works:
def self_consistent_answer(prompt: str, n: int = 10, temperature: float = 0.7) -> str:
"""Generate n responses and return the majority answer."""
answers = []
for _ in range(n):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
answer = extract_answer(response.content[0].text)
answers.append(answer)
return Counter(answers).most_common(1)[0][0]
extract_answer is task-specific — you might parse the final number from a math problem, extract a category label, or take the last sentence.
Adaptive Self-Consistency
A cost-saving optimization: start with a few samples and only generate more if answers don't converge.
Sample 3 responses → All agree → Return answer (done, low cost)
Sample 3 responses → 2/3 agree → Sample 2 more → 4/5 agree → Return answer
Sample 3 responses → All different → Sample 7 more → Take majority of 10
This reduces average cost while maintaining accuracy for easy questions.
Self-Consistency vs. Similar Techniques
| Technique | What it does | Best for |
|---|---|---|
| Self-consistency | Sample many chains, vote | Reasoning with a right answer |
| Chain of Thought | Show reasoning in one pass | Structured single-response reasoning |
| Tree of Thought | Explore branching reasoning trees | Complex planning, search problems |
| Ensemble prompting | Different prompts, same question | Reducing prompt sensitivity |
Self-consistency and Tree of Thought are complementary: ToT explores reasoning space deliberately, while self-consistency samples it stochastically.
Key Takeaways
- Generate multiple CoT responses with temperature > 0
- Aggregate by majority vote on the final answer
- Works best for tasks with single correct answers
- 5–10 samples gives most of the benefit
- Not worth it for simple tasks or creative/open-ended output