LLMs are remarkable reasoners but unreliable calculators. Program-Aided Language Models (PAL) solves this by having the model write code instead of computing answers — letting the interpreter handle what it does better.
The Problem with LLM Arithmetic
Ask an LLM to solve a multi-step math word problem and it'll often get the reasoning right but the arithmetic wrong:
Q: A train leaves at 9:15 AM and arrives 3 hours 47 minutes later.
The ticket costs $2.50 per stop, and there are 8 stops.
The passenger gets a 15% discount. What is the final price?
LLM answer: "The train arrives at 1:02 PM. The final price is $17.50."
(Correct arrival time, wrong price — arithmetic error in the discount calculation)
The model understands the problem perfectly but makes calculation errors because it's generating tokens, not computing.
The PAL Solution
Instead of computing the answer, write Python that computes it:
Q: A train leaves at 9:15 AM and arrives 3 hours 47 minutes later.
The ticket costs $2.50 per stop, and there are 8 stops.
The passenger gets a 15% discount. What is the final price?
Write Python code to solve this step by step, then execute it.
# Solution
from datetime import datetime, timedelta
# Arrival time
departure = datetime.strptime("9:15 AM", "%I:%M %p")
travel_time = timedelta(hours=3, minutes=47)
arrival = departure + travel_time
print(f"Arrives at: {arrival.strftime('%I:%M %p')}") # 01:02 PM
# Ticket price
price_per_stop = 2.50
stops = 8
base_price = price_per_stop * stops # 20.00
discount = 0.15
final_price = base_price * (1 - discount) # 17.00
print(f"Final price: ${final_price:.2f}") # $17.00
Execution output: Arrives at: 01:02 PM / Final price: $17.00
The code is correct. The interpreter produces the exact answer.
The PAL Prompt Pattern
A PAL prompt typically uses few-shot examples that show the model how to produce code instead of direct answers:
Solve the following math problems by writing Python code. Use comments to explain
your reasoning. Then I'll run the code and give you the output.
Example 1:
Problem: If a store sells 240 items at $12.50 each but gives a 20% discount
to customers who buy more than 10 items, and John bought 15 items, what does
John pay?
# John bought more than 10, so gets the discount
price_per_item = 12.50
items = 15
discount = 0.20
total = items * price_per_item * (1 - discount)
print(f"John pays: ${total:.2f}")
# Output: John pays: $150.00
---
Now solve:
Problem: [Your problem here]
PAL for Non-Math Problems
PAL extends beyond arithmetic to any task that benefits from programmatic exactness:
Date and time calculations:
from datetime import date, timedelta
start = date(2026, 3, 1)
days_until_quarter_end = (date(2026, 6, 30) - start).days
print(f"Days remaining in Q2: {days_until_quarter_end}") # 121
String manipulation:
# Count vowels in a sentence
sentence = "The quick brown fox jumps over the lazy dog"
vowels = sum(1 for c in sentence.lower() if c in 'aeiou')
print(f"Vowel count: {vowels}") # 11
Logical operations:
# Which candidates meet all criteria?
candidates = [
{"name": "Alice", "years_exp": 5, "python": True, "remote": True},
{"name": "Bob", "years_exp": 3, "python": True, "remote": False},
{"name": "Carol", "years_exp": 7, "python": False, "remote": True},
]
qualified = [c["name"] for c in candidates
if c["years_exp"] >= 4 and c["python"] and c["remote"]]
print(qualified) # ['Alice']
Accuracy Comparison
Research from Gao et al. (2022) compared PAL against standard CoT on math benchmarks:
| Benchmark | Standard CoT | PAL |
|---|---|---|
| GSM8K (grade school math) | 56.4% | 72.0% |
| MATH (competition math) | 6.5% | 12.0% |
| AQuA (algebra) | 47.4% | 55.5% |
PAL consistently outperforms pure language generation on tasks involving computation.
Implementing PAL Safely
The critical concern with PAL is code execution safety. Never run LLM-generated code without sandboxing:
import subprocess
import tempfile
import os
def run_pal(code: str, timeout: int = 5) -> str:
"""Execute PAL-generated code in a safe subprocess."""
# Write to temp file
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(code)
tmp_path = f.name
try:
result = subprocess.run(
["python3", "-c", f"import sys; exec(open('{tmp_path}').read())"],
capture_output=True,
text=True,
timeout=timeout,
# Restrict environment
env={"PATH": "/usr/bin", "HOME": "/tmp"},
)
return result.stdout if result.returncode == 0 else f"Error: {result.stderr}"
finally:
os.unlink(tmp_path)
Minimum safety requirements:
- Sandboxed subprocess (no network, no filesystem access beyond /tmp)
- Hard timeout (prevent infinite loops)
- Memory limits
- Allowlist of safe imports (block
os,subprocess,requests, etc.)
PAL vs. Tool Calling
| Approach | How it works | Best for |
|---|---|---|
| PAL | LLM writes full programs | Complex multi-step computation |
| Calculator tool | LLM calls a pre-built calculator | Simple arithmetic |
| Code interpreter | LLM writes code, sandbox runs it | General computation, data analysis |
Modern AI systems like Claude's code interpreter and ChatGPT's Advanced Data Analysis implement PAL-like patterns as first-class features.
Key Takeaways
- Have the LLM write code to solve problems instead of computing directly
- Execute the code in a sandboxed environment and use the output as the answer
- The LLM handles language understanding; the interpreter handles computation
- Use few-shot examples showing the code-writing format
- Always sandbox code execution — never run LLM-generated code in production without safety controls