OpenClaw's LLM API costs are variable — they depend on your message volume, model choice, and how much context is passed per request. With the right configuration, you can control costs precisely without sacrificing quality where it matters.
Understanding Where Tokens Are Spent
Every OpenClaw request sends:
- System prompt — Your SOUL.md content (constant per request)
- Memory context — Relevant snippets from your conversation history
- Integration context — Data from connected tools (if relevant)
- Your message — The actual input
- Response — The AI's reply (output tokens)
For a typical request:
- System prompt: 200–600 tokens
- Memory context: 200–1000 tokens
- Your message: 20–200 tokens
- Response: 100–500 tokens
Total: 600–2300 tokens per request. At GPT-4o pricing ($5/M input, $15/M output), that's roughly $0.01–0.04 per message.
At 50 messages/day: $15–60/month on GPT-4o. At 50 messages/day on Gemini Flash: $0.15–0.50/month.
Monitoring Current Usage
# View token usage by day
openclaw stats --period 7d
# View usage by model
openclaw stats --by-model
# View top token-consuming conversations
openclaw stats --top-conversations
Set up a weekly usage report:
scheduled_tasks:
- name: "weekly-cost-report"
cron: "0 9 * * 1" # Monday 9am
action: "internal"
function: "generate_usage_report"
format: "weekly"
send_to: "whatsapp" # or "telegram", "slack"
Model Selection: The Biggest Lever
The model you use is the single largest cost driver. The cost differences are dramatic:
| Model | Input Cost (per 1M tokens) | Output Cost | Relative Cost |
|---|---|---|---|
| GPT-4o | ~$5 | ~$15 | 1x (baseline) |
| Claude Sonnet | ~$3 | ~$15 | ~0.8x |
| GPT-4o-mini | ~$0.15 | ~$0.60 | ~0.04x |
| Gemini 2.0 Flash | ~$0.075 | ~$0.30 | ~0.02x |
| Claude Haiku | ~$0.25 | ~$1.25 | ~0.05x |
| Local (Ollama/LM Studio) | $0 | $0 | 0x |
Switching from GPT-4o to GPT-4o-mini for everyday messages reduces API costs by ~96% on those messages, with only modest quality reduction for routine tasks.
Smart Routing by Task Type
llm:
routing:
- pattern: "^(analyse|research|compare|review|explain in depth|write long)"
model: "gpt-4o" # Premium for demanding tasks
- pattern: "^(remind|check|what time|quick|summarise briefly)"
model: "gpt-4o-mini" # Budget for quick tasks
- default:
model: "gemini-2.0-flash" # Cheapest for everything else
Reducing Context Size
Every token of context costs money. Trimming what gets sent per request reduces costs without changing which model you use.
SOUL.md Optimisation
Your SOUL.md is sent with every request. A 1000-token SOUL.md adds $0.005 to every GPT-4o call. At 50 messages/day: ~$7.50/month just for the SOUL.md.
Audit your SOUL.md for:
- Redundant instructions ("be helpful" — already the default)
- Instructions that only apply to rare situations (move to per-message)
- Verbose phrasing that can be shortened
A well-tuned SOUL.md of 200–400 tokens costs $1.50–3/month vs $7.50+ for a bloated 1000-token version.
Memory Context Limits
memory:
max_context_tokens: 500 # Limit memory snippets per request
relevance_threshold: 0.7 # Only include highly relevant memories
max_memories_per_request: 5
Reducing from unlimited memory context to a 500-token cap can cut 30–50% off token costs, with minimal practical impact on response quality.
Context Window Trimming
For long conversations, OpenClaw maintains a rolling context window. Configure the window size:
llm:
context_window:
max_conversation_turns: 10 # Only last 10 turns in context
max_tokens: 4000 # Hard token limit on conversation history
trimming_strategy: "sliding" # Keep the most recent turns
Budget Controls
Per-Provider Monthly Limits
providers:
openai:
api_key: "sk-..."
monthly_token_budget: 1000000 # 1M tokens/month
warn_at_percent: 80 # Alert at 80% usage
action_at_limit: "fallback" # Switch to fallback provider
fallback_provider: "gemini"
gemini:
api_key: "AIza-..."
monthly_token_budget: 5000000 # Generous Gemini budget
Also Set Limits at the Provider Level
OpenClaw's internal limits are a safety net, but always set spending limits directly in:
- OpenAI: Platform → Billing → Usage limits
- Anthropic: Console → Billing → Set limit
- Google: Cloud Console → Billing → Budgets & alerts
A dual-layer limit means even if OpenClaw's config has a bug, your API provider won't charge you beyond your set cap.
Local Models: Zero API Cost
For users with adequate hardware, running local models via Ollama or LM Studio eliminates API costs entirely:
providers:
ollama:
base_url: "http://localhost:11434/v1"
api_key: "ollama"
default_model: "llama3.1:8b-instruct-q4_K_M"
Use a hybrid approach — local for routine tasks, cloud LLM for complex ones:
llm:
routing:
- pattern: "^(analyse|research|code review|draft)"
provider: "anthropic"
model: "claude-sonnet-4-5"
- default:
provider: "ollama"
model: "llama3.1:8b-instruct-q4_K_M"
This gives you zero-cost everyday operation with premium quality on demand.
Caching Repeated Requests
For automations that repeatedly ask similar questions (daily briefings, status checks), enable response caching:
llm:
cache:
enabled: true
ttl: 3600 # Cache responses for 1 hour
max_size: 100 # Store up to 100 cached responses
cache_threshold: 0.95 # Only cache near-identical requests
A cached response costs zero tokens. For scheduled tasks that run hourly, this can reduce costs by 80%+ on those tasks.
Monthly Cost Targets by Profile
| User Type | Daily Messages | Recommended Setup | Monthly Cost |
|---|---|---|---|
| Light personal | 10–20 | Gemini Flash default | <$1 |
| Active personal | 30–50 | Gemini Flash + GPT-4o for complex | $3–10 |
| Heavy personal | 100+ | GPT-4o-mini default, GPT-4o for complex | $10–25 |
| Local-first | Any | Ollama default, cloud fallback | $0–5 |
| Developer/power | 200+ | Claude Sonnet default | $20–50 |
Related reading: