What is Program-Aided Language Models (PAL)?

PAL (Gao et al., 2022) is a technique where instead of asking an LLM to compute an answer directly, you ask it to write a program (usually Python) that solves the problem. The program is then executed by an interpreter, and the output is the final answer. The LLM handles language understanding and logic structure; the interpreter handles computation. This eliminates arithmetic errors entirely.

Why do LLMs make math errors that PAL fixes?

LLMs generate tokens probabilistically — they're not running actual arithmetic. When you ask 'what is 1,847 × 293?', the model guesses the answer based on patterns in training data. For simple calculations it often gets lucky, but multi-step or large-number arithmetic is unreliable. PAL bypasses this by writing code that the interpreter computes exactly.

Does PAL work for non-math problems?

Yes. PAL works for any problem that can be formalized as code: date calculations, string manipulation, data transformations, counting, sorting, set operations, and logical branching. If you can express the solution as a program, PAL can often handle it more reliably than pure language generation.

What are the risks of executing LLM-generated code?

The main risk is code injection — a maliciously crafted prompt could cause the model to write harmful code. Always execute PAL-generated code in a sandboxed environment with no filesystem access, no network access, and tight resource limits. Validate inputs before passing them to the prompt, and restrict the Python environment to safe builtins.

Program-Aided Language Models (PAL): Offload Computation to Code

LLMs are remarkable reasoners but unreliable calculators. Program-Aided Language Models (PAL) solves this by having the model write code instead of computing answers — letting the interpreter handle what it does better.

The Problem with LLM Arithmetic

Ask an LLM to solve a multi-step math word problem and it'll often get the reasoning right but the arithmetic wrong:

Q: A train leaves at 9:15 AM and arrives 3 hours 47 minutes later.
   The ticket costs $2.50 per stop, and there are 8 stops.
   The passenger gets a 15% discount. What is the final price?

LLM answer: "The train arrives at 1:02 PM. The final price is $17.50."
(Correct arrival time, wrong price — arithmetic error in the discount calculation)

The model understands the problem perfectly but makes calculation errors because it's generating tokens, not computing.

The PAL Solution

Instead of computing the answer, write Python that computes it:

Q: A train leaves at 9:15 AM and arrives 3 hours 47 minutes later.
   The ticket costs $2.50 per stop, and there are 8 stops.
   The passenger gets a 15% discount. What is the final price?

Write Python code to solve this step by step, then execute it.

# Solution
from datetime import datetime, timedelta

# Arrival time
departure = datetime.strptime("9:15 AM", "%I:%M %p")
travel_time = timedelta(hours=3, minutes=47)
arrival = departure + travel_time
print(f"Arrives at: {arrival.strftime('%I:%M %p')}")  # 01:02 PM

# Ticket price
price_per_stop = 2.50
stops = 8
base_price = price_per_stop * stops  # 20.00
discount = 0.15
final_price = base_price * (1 - discount)  # 17.00
print(f"Final price: ${final_price:.2f}")  # $17.00

Execution output: Arrives at: 01:02 PM / Final price: $17.00

The code is correct. The interpreter produces the exact answer.

The PAL Prompt Pattern

A PAL prompt typically uses few-shot examples that show the model how to produce code instead of direct answers:

Solve the following math problems by writing Python code. Use comments to explain
your reasoning. Then I'll run the code and give you the output.

Example 1:
Problem: If a store sells 240 items at $12.50 each but gives a 20% discount
to customers who buy more than 10 items, and John bought 15 items, what does
John pay?

# John bought more than 10, so gets the discount
price_per_item = 12.50
items = 15
discount = 0.20
total = items * price_per_item * (1 - discount)
print(f"John pays: ${total:.2f}")
# Output: John pays: $150.00

---

Now solve:
Problem: [Your problem here]

PAL for Non-Math Problems

PAL extends beyond arithmetic to any task that benefits from programmatic exactness:

Date and time calculations:

from datetime import date, timedelta

start = date(2026, 3, 1)
days_until_quarter_end = (date(2026, 6, 30) - start).days
print(f"Days remaining in Q2: {days_until_quarter_end}")  # 121

String manipulation:

# Count vowels in a sentence
sentence = "The quick brown fox jumps over the lazy dog"
vowels = sum(1 for c in sentence.lower() if c in 'aeiou')
print(f"Vowel count: {vowels}")  # 11

Logical operations:

# Which candidates meet all criteria?
candidates = [
    {"name": "Alice", "years_exp": 5, "python": True, "remote": True},
    {"name": "Bob", "years_exp": 3, "python": True, "remote": False},
    {"name": "Carol", "years_exp": 7, "python": False, "remote": True},
]

qualified = [c["name"] for c in candidates
             if c["years_exp"] >= 4 and c["python"] and c["remote"]]
print(qualified)  # ['Alice']

Accuracy Comparison

Research from Gao et al. (2022) compared PAL against standard CoT on math benchmarks:

Benchmark	Standard CoT	PAL
GSM8K (grade school math)	56.4%	72.0%
MATH (competition math)	6.5%	12.0%
AQuA (algebra)	47.4%	55.5%

PAL consistently outperforms pure language generation on tasks involving computation.

Implementing PAL Safely

The critical concern with PAL is code execution safety. Never run LLM-generated code without sandboxing:

import subprocess
import tempfile
import os

def run_pal(code: str, timeout: int = 5) -> str:
    """Execute PAL-generated code in a safe subprocess."""

    # Write to temp file
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(code)
        tmp_path = f.name

    try:
        result = subprocess.run(
            ["python3", "-c", f"import sys; exec(open('{tmp_path}').read())"],
            capture_output=True,
            text=True,
            timeout=timeout,
            # Restrict environment
            env={"PATH": "/usr/bin", "HOME": "/tmp"},
        )
        return result.stdout if result.returncode == 0 else f"Error: {result.stderr}"
    finally:
        os.unlink(tmp_path)

Minimum safety requirements:

Sandboxed subprocess (no network, no filesystem access beyond /tmp)
Hard timeout (prevent infinite loops)
Memory limits
Allowlist of safe imports (block os, subprocess, requests, etc.)

PAL vs. Tool Calling

Approach	How it works	Best for
PAL	LLM writes full programs	Complex multi-step computation
Calculator tool	LLM calls a pre-built calculator	Simple arithmetic
Code interpreter	LLM writes code, sandbox runs it	General computation, data analysis

Modern AI systems like Claude's code interpreter and ChatGPT's Advanced Data Analysis implement PAL-like patterns as first-class features.

Key Takeaways

Have the LLM write code to solve problems instead of computing directly
Execute the code in a sandboxed environment and use the output as the answer
The LLM handles language understanding; the interpreter handles computation
Use few-shot examples showing the code-writing format
Always sandbox code execution — never run LLM-generated code in production without safety controls