Claude's computer use capability — where the model can see your screen, move a cursor, type, and click — is one of those features that sounds like magic until you actually try to use it. Then you run into walls fast: the model clicks the wrong button, gets confused by a dropdown it can't read, or loops endlessly trying to find an element that moved.
The computer use API is genuinely useful. But it requires a different prompting mindset than you're used to. This guide covers what actually works.
What computer use is (and what it isn't)
Claude computer use gives the model three tools: screenshot (see the current screen state), computer (move mouse, click, type, press keys), and optionally bash (run shell commands). You define a task, Claude takes a screenshot to understand the current state, decides what action to take, acts, screenshots again, and repeats until done or stuck.
This is agentic prompting — each step affects the environment, which changes what the model sees next. Unlike a chat that you can undo, computer use actions are real. If Claude clicks "Delete" on a file, that file is gone.
The beta API works best for:
- Repetitive desktop tasks that follow consistent UI patterns
- Form-filling workflows across apps that don't have APIs
- Browser automation where the page structure changes and CSS selectors break
- Testing UI flows exactly as a user would see them
It struggles with:
- Captchas and login flows with 2FA
- Rapidly changing UIs (animations, hover states)
- Tasks requiring precise pixel-level targeting
- Any workflow that requires real-time response (it's not fast)
The core prompting pattern
The most reliable pattern is: be specific about the goal, name the application, describe the starting state, and define what "done" looks like.
Vague:
Fill out the expense report.
Better:
Open the Concur expense report app (already open in Chrome at concur.company.com).
I need to submit a new expense report for the March 4 team dinner.
The receipt is at ~/Downloads/dinner-receipt.pdf.
Amount: $342.50, Category: Team Meals, Project code: PROJ-2024-Q1.
The report is done when you see the confirmation screen with a submission number.
The specificity matters because Claude can't ask clarifying questions mid-task the same way you can in chat. It has to make decisions and it'll guess when uncertain — and guesses in UI automation are often wrong.
Grounding with screenshots
Don't assume Claude knows what state the screen is in. Start every session with an explicit screenshot request:
First, take a screenshot so you can see the current state of the screen.
Then proceed to [task].
Better still, describe what Claude should see before it starts:
Chrome is open with Gmail. The inbox is visible. I want you to:
1. Search for emails from sarah@company.com in the last 7 days
2. Star each one that mentions "Q1 budget"
3. Report back how many you starred
When Claude knows what it should be looking at, it can immediately confirm or flag if something is wrong.
Task decomposition for complex workflows
Long workflows fail more than short ones. A task with 15 steps has more ways to go wrong than one with 3 steps. Break complex automation into checkpointed chunks.
Instead of:
Download the sales data from our CRM, clean it in Excel, and upload it to the Google Sheet.
Run three separate calls:
# Call 1
Open Salesforce at [URL]. Export the Q1 opportunities report as CSV to ~/Downloads/.
Done when the file appears in ~/Downloads/.
# Call 2 (verify file exists first)
Open ~/Downloads/opportunities-q1.csv in Excel.
Remove rows where Stage is "Prospecting".
Save as ~/Downloads/opportunities-q1-cleaned.csv.
# Call 3
Open Google Sheets at [URL].
Import ~/Downloads/opportunities-q1-cleaned.csv into the "Q1 Data" sheet, replacing existing content.
Each call is independently verifiable. If step 2 fails, you haven't wasted the work from step 1.
Handling UI ambiguity
Claude sees the screen as an image. It uses visual understanding to identify buttons, fields, and text — but it can get confused by:
Similar-looking elements: "Click the Save button" when there's both a "Save" and "Save as Draft" button. Be explicit: "Click the blue 'Save' button in the top-right corner, not 'Save as Draft'."
Dynamic content: Dropdowns, modals, and tooltips that appear on hover. Tell Claude to hover first and wait: "Hover over the 'Actions' menu in the top toolbar. Wait for the dropdown to appear, then click 'Export'."
Scroll position: Claude might not know content is below the fold. "Scroll down on the left sidebar until you see 'Advanced Settings', then click it."
State changes: After clicking, the UI might take a moment to update. Tell Claude to wait and re-screenshot: "Click 'Generate Report'. Wait for the progress bar to disappear (it may take 10-15 seconds), then screenshot to confirm the report appeared."
Error recovery prompting
Agentic tasks fail. Build recovery into your prompts.
Your goal is to export the monthly report from the dashboard.
If you encounter a permissions error, stop and describe exactly what you saw.
If the Export button is grayed out or disabled, take a screenshot and report what you see — do not try to proceed.
If the page takes more than 30 seconds to load, stop and report.
Do not attempt more than 3 times if an action doesn't produce the expected result.
Without explicit stopping conditions, Claude will sometimes retry actions in a loop, clicking the same broken button repeatedly or trying different approaches that all fail for the same underlying reason.
The system prompt for reliability
For production computer use, a strong system prompt matters more than the individual task prompts. Here's a pattern that works:
You are an automation assistant with access to a desktop computer.
Core behaviors:
- Take a screenshot before starting any task to confirm the current screen state
- After each significant action, take a screenshot to verify the result
- If something looks unexpected, stop and describe what you see before continuing
- Never click "Delete", "Remove", "Unsubscribe", or any destructive action unless explicitly instructed
- If you encounter a login screen, stop and report — do not attempt to enter credentials
- Prefer keyboard shortcuts over mouse clicks when both are available
- When typing in a field, click the field first to confirm focus before typing
Uncertainty protocol:
- If you're 90%+ confident in an action, proceed
- If you're less certain, describe what you see and what you're considering doing, then ask
- If you're stuck after 2 attempts at the same step, stop and report what happened
Output format:
- Before each action: briefly describe what you're about to do
- After completing the task: summarize what you did and confirm the end state
Practical example: web form automation
Here's a full example that works reliably for filling a web form:
System: [Use the system prompt above]
User: Fill out the vendor onboarding form at [URL] with this information:
- Company name: Acme Corp
- Contact email: billing@acme.com
- Primary use case: Software licensing
- Monthly volume: $5,000-$10,000
- Tax ID: 12-3456789
Before submitting, take a screenshot of the filled form so I can review it.
Do NOT click Submit until I confirm.
The "show me before submitting" instruction is important. It adds a human checkpoint before irreversible action — a pattern you should use for anything that emails someone, charges money, or deletes data.
Bash vs computer for automation
When available, prefer bash over mouse clicks for tasks that can be done programmatically. Bash is faster, more reliable, and not dependent on UI state.
Computer use is ideal for tasks that genuinely require the visual interface — forms that aren't accessible via API, applications without CLI access, tasks where you need to visually confirm what you're doing.
But if the task can be done with a curl request, a Python script, or a CLI command, prompt Claude to use bash instead. The output is deterministic and doesn't depend on what's rendering on screen.
Option A (computer): Navigate to the downloads page, find the CSV, click download...
Option B (bash): curl -H "Authorization: Bearer $TOKEN" https://api.service.com/export/csv -o data.csv
Option B will succeed or fail clearly. Option A might work, or might click the wrong link, or might get blocked by a cookie banner.
What to build now
Computer use is in beta for a reason — it's capable but not production-ready for high-stakes autonomous workflows. Where it shines today:
- Internal tooling where you control the environment and can test thoroughly
- Research workflows that are tedious but not critical (scraping structured data from pages, filling out forms you'd otherwise do manually)
- Testing your own applications from a user's perspective
For anything customer-facing or involving sensitive data, keep a human in the loop. Use it as a powerful assistant, not a fully autonomous agent.
The capability is evolving fast. The prompting patterns that work now — specific goals, checkpointed tasks, explicit stopping conditions, human confirmation before irreversible actions — will keep working as the model improves.
If you're building agentic workflows, the Agents track covers the underlying patterns. And for other automation approaches that don't require a visual interface, the function calling lesson is worth reading alongside this.



