Computer use has improved a lot since the 2024 launch. The OCR is better, screenshot analysis is faster, and element targeting is more reliable. It's still not the right tool for most tasks — if there's an API, use the API. But for specific use cases, it's surprisingly capable.
This post covers five workflows that actually work reliably in 2026, along with honest numbers on cost, reliability, and when to pick Stagehand or the direct API instead.
What computer use is (and isn't)
The architecture: a Docker container with a virtual desktop → Claude takes screenshots → sends to the API → receives mouse/keyboard actions → acts on the screen.
This means:
- High latency: 2–5 seconds per action (screenshot upload + API call + action execution)
- High cost: ~$0.10–0.30 per screenshot analysis. A 20-step workflow costs $2–6.
- No API calls: Claude literally sees what a human sees and interacts the same way
When it's the right choice: legacy systems with no API, government portals, vendor software you can't integrate with, enterprise systems that predate modern APIs.
When it's the wrong choice: anything with an API. Web scraping that doesn't require interaction. Forms on modern sites (use Stagehand instead — it's cheaper and more reliable).
Setup
# Pull the official computer use Docker image
docker pull ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest
# Run with VNC access
docker run -it \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-v $HOME/.anthropic:/home/user/.anthropic \
-p 5900:5900 \ # VNC port
-p 8080:8080 \ # Web interface
ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest
Open http://localhost:8080 to see the virtual desktop and interact with it.
For production, run this on a VPS (you need at least 2GB RAM for the container):
# On Hostinger VPS or any cloud server
docker run -d \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-p 5900:5900 \
-p 8080:8080 \
ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest
The computer use API
import anthropic
import base64
client = anthropic.Anthropic()
def take_screenshot() -> str:
"""Capture current desktop screenshot as base64.
In the Docker container, use the provided screenshot tool."""
import subprocess
result = subprocess.run(
["scrot", "-o", "/tmp/screenshot.png"],
capture_output=True
)
with open("/tmp/screenshot.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode()
def run_computer_use_agent(task: str, max_steps: int = 25) -> str:
messages = [{
"role": "user",
"content": task
}]
for step in range(max_steps):
# Add current screenshot
screenshot = take_screenshot()
response = client.messages.create(
model="claude-opus-4-7", # Computer use requires Opus for reliability
max_tokens=1024,
tools=[
{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1,
}
],
messages=messages + [{
"role": "user",
"content": [{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot,
}
}]
}],
)
if response.stop_reason == "end_turn":
return response.content[-1].text if response.content else "Task complete"
# Execute tool actions
for block in response.content:
if block.type == "tool_use" and block.name == "computer":
execute_computer_action(block.input)
messages.append({"role": "assistant", "content": response.content})
return "Reached step limit"
def execute_computer_action(action: dict):
"""Execute a mouse/keyboard action on the virtual desktop."""
import pyautogui
action_type = action.get("action")
if action_type == "screenshot":
pass # Screenshot is taken at the start of each loop
elif action_type == "mouse_move":
pyautogui.moveTo(action["coordinate"][0], action["coordinate"][1])
elif action_type == "left_click":
pyautogui.click(action["coordinate"][0], action["coordinate"][1])
elif action_type == "type":
pyautogui.typewrite(action["text"], interval=0.05)
elif action_type == "key":
pyautogui.press(action["key"])
elif action_type == "scroll":
pyautogui.scroll(action.get("direction", "down") == "up" and 3 or -3)
Workflow 1: Automated form filling for government portals
The use case that no API covers: MCA21 filings, GST portal operations, state government portals, tender submission portals. All have UIs, none have public APIs.
# Prepare the data you want to submit
form_data = {
"company_name": "MasterPrompting Technologies Pvt Ltd",
"cin": "U72900KA2024PTC123456",
"director_din": "12345678",
"filing_type": "MGT-7",
"financial_year": "2025-26",
}
task = f"""
Open Firefox and navigate to mca.gov.in/mcafoportal/login.do
Log in with credentials:
- Username: {MCA_USERNAME}
- Password: {MCA_PASSWORD}
After login, navigate to e-Filing → Annual Return (MGT-7).
Fill in the form with this data:
{form_data}
Take a screenshot before submitting.
DO NOT click the final Submit button — stop at the review page and describe what you see.
"""
result = run_computer_use_agent(task)
print(result) # Description of the review page for human to approve
Reliability: ~85% on simple single-page forms, ~60% on multi-page wizard forms with dynamic validation. Always stop before final submission and require human approval.
Workflow 2: UI regression testing
Claude as a visual QA checker — not for unit tests, but for "does this look right?" validation:
checklist = """
Navigate to https://staging.yourapp.com/login
Check and report on each item:
1. Is the logo visible in the top-left corner?
2. Are there two input fields labeled "Email" and "Password"?
3. Is the "Sign In" button visible and enabled?
4. Is there a "Forgot password?" link?
5. Does the page have a dark mode toggle?
6. Are there any visible error messages or broken images?
7. Does clicking "Sign In" with empty fields show validation messages?
For each check: PASS, FAIL, or PARTIAL. Describe any failures specifically.
"""
result = run_computer_use_agent(checklist)
print(result)
# → "1. PASS - Logo visible top-left
# 2. PASS - Two fields present
# 3. PASS
# 4. PASS
# 5. FAIL - No dark mode toggle visible on this page
# 6. PASS - No errors
# 7. PASS - 'Email is required' validation shown"
This is more reliable than Selenium for visual checks because it doesn't depend on element IDs or CSS classes — it actually looks at the rendered page.
Workflow 3: Legacy ERP data extraction
An old ERP with no export API, only a web UI. Extract 100 purchase orders by navigating the interface:
task = """
Open Chrome and navigate to http://erp.internal.company.com/login
Log in with the provided credentials.
Navigate to: Procurement → Purchase Orders → All POs
For each purchase order in the list (up to 20):
1. Click on the PO number to open it
2. Extract: PO number, vendor name, total amount, status, date created
3. Go back to the list
4. Move to the next PO
Format the extracted data as a JSON array.
Stop when you've processed 20 POs or reach the end of the list.
"""
result = run_computer_use_agent(task, max_steps=50)
# Parse the JSON from result
import json, re
json_match = re.search(r'\[.*\]', result, re.DOTALL)
if json_match:
po_data = json.loads(json_match.group())
At 2–4 seconds per PO (navigate, read, extract, back) = 40–80 seconds for 20 POs. Budget $1–2 for this extraction. Slow, but it works on systems you can't otherwise touch.
Workflow 4: Cross-browser visual QA
import subprocess
browsers = ["google-chrome", "firefox"]
results = {}
for browser in browsers:
subprocess.Popen([browser, "https://yourapp.com"])
import time; time.sleep(3)
task = f"""
The {browser} browser should now be open with https://yourapp.com loaded.
Take a screenshot and describe:
1. Does the header render correctly (logo, navigation, dark mode toggle)?
2. Are the fonts rendering (no missing glyphs)?
3. Is the layout centered or broken?
4. Rate the visual appearance: Good/Degraded/Broken
Be specific about any issues you see.
"""
results[browser] = run_computer_use_agent(task, max_steps=5)
# Compare results
for browser, result in results.items():
print(f"\n{browser}:")
print(result)
Workflow 5: Bulk data labeling via a UI labeling tool
When your labeling tool (Scale AI, Label Studio) doesn't expose the specific workflow you need via API:
task = """
Label Studio is open in the browser with the Image Classification project.
For each unlabeled image shown:
1. Look at the image
2. Identify the main subject: cat, dog, bird, other
3. Click the appropriate label button
4. Click Next to move to the following image
Continue until you've labeled 50 images or the project shows "Complete".
Count: keep track of how many of each label you've applied.
"""
result = run_computer_use_agent(task, max_steps=200)
print(result) # Summary: "Labeled 50 images: 23 cat, 15 dog, 8 bird, 4 other"
Cost reality check
| Workflow | Steps | Cost | Reliability |
|---|---|---|---|
| Simple form fill | 10–15 | $1–3 | ~85% |
| UI regression test | 5–8 | $0.50–1.50 | ~90% |
| Legacy data extraction | 50–100 | $5–20 | ~70% |
| Cross-browser QA | 5–10 | $0.50–2 | ~90% |
| Bulk labeling (50 items) | 100–150 | $10–30 | ~75% |
For anything that can use an API or Stagehand, those options are significantly cheaper. Computer use's value is exclusively in the "no API available" case.
Use claude-opus-4-7 for computer use tasks — the smaller models are noticeably less reliable at spatial reasoning and element targeting.



