BeautifulSoup selectors break the moment a site redesigns. XPath is brittle. Both fail completely on JavaScript-rendered pages. I've maintained scrapers that needed patching every two weeks because some product team moved a div. There's a better stack now.
Firecrawl converts any URL to clean markdown. An LLM pulls out exactly the data you want. The selector never breaks because there is no selector. The extraction instruction is in plain English, and when the page layout changes, the same instruction still works.
Why clean markdown first
Raw HTML is roughly 10× noisier than the same content as markdown. A typical product page is 80KB of HTML. Navigation, cookie banners, footer links, ad scripts, inline styles, tracking pixels — all of it is in there, and none of it is the data you want. The same content as markdown is around 6KB, with all that noise stripped.
Feeding an LLM raw HTML has two problems. First, you're burning tokens on content the model needs to ignore. Second, the structural complexity of HTML actively confuses extraction — the model has to work out what's content and what's markup, and it doesn't always get it right.
Firecrawl handles the hard parts: JavaScript rendering, waiting for dynamic content to load, stripping navigation and ads, and returning clean prose. You send a URL and get back something that looks like a well-formatted article. Then you hand that to the LLM with a specific extraction instruction.
Setup
pip install firecrawl-py openai pydantic python-dotenv
Get your Firecrawl API key at firecrawl.dev — the free tier covers 500 pages per month, which is enough to get production-quality results before you commit to a paid plan. Then set your environment variables:
FIRECRAWL_API_KEY=fc-your-key
AICREDITS_API_KEY=sk-your-aicredits-key
Indian developers: access Claude, GPT-4o, and Gemini through AICredits.in — INR billing, UPI top-up, no international card.
The core pipeline
Every extraction follows the same three-step pattern: fetch the page as markdown, define a Pydantic schema for what you want, ask the LLM to fill it.
import os
import json
import firecrawl
from openai import OpenAI
from pydantic import BaseModel
fc = firecrawl.FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
client = OpenAI(
api_key=os.environ["AICREDITS_API_KEY"],
base_url="https://api.aicredits.in/v1"
)
def extract_from_url(url: str, schema: type[BaseModel], instruction: str) -> BaseModel:
result = fc.scrape_url(url, params={"formats": ["markdown"]})
markdown = result["markdown"]
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{
"role": "system",
"content": f"Extract data as JSON matching this schema exactly:\n{schema.schema_json()}"
},
{
"role": "user",
"content": f"{instruction}\n\nContent:\n{markdown[:8000]}"
}
],
response_format={"type": "json_object"}
)
return schema(**json.loads(response.choices[0].message.content))
The response_format={"type": "json_object"} forces the model to return valid JSON — no markdown fences, no explanation, just the object. Pydantic validates it on the way out. If the model returns a field with the wrong type, you get a validation error immediately instead of a silent data corruption issue downstream. See the structured outputs post for a deeper look at why this matters.
The markdown[:8000] truncation is intentional. Most product pages, job listings, and articles contain all the useful information in the first 8,000 characters of cleaned markdown. If you're working with very long pages — documentation, research papers, dense reports — increase that limit, but be aware you're paying for those extra tokens.
Use case 1: competitor price monitoring
The schema does the heavy lifting. Define exactly what you want to extract, and the model figures out where on the page to find it.
class PricingPage(BaseModel):
company_name: str
plans: list[dict]
last_updated: str | None
pricing = extract_from_url(
url="https://competitor.com/pricing",
schema=PricingPage,
instruction="Extract all pricing plan details including plan names, prices, billing periods, and key features."
)
The plans field uses list[dict] here rather than a nested Pydantic model because pricing page structures vary wildly. One competitor has three tiers. Another has 12 add-ons. A rigid schema breaks on the unusual cases; a list of dicts accepts whatever the model finds.
Run this nightly with a cron job. After each run, diff the result against yesterday's output and send a Slack message when any price changes. I've had teams discover competitor pricing changes this way before the sales team did.
Use case 2: job listing aggregator
Job boards are notorious for inconsistent structure. Some show salary, some don't. Some list required skills, some list "nice to haves." The str | None pattern handles optional fields cleanly.
class JobListing(BaseModel):
title: str
company: str
location: str
salary_range: str | None
skills_required: list[str]
posted_date: str | None
job = extract_from_url(
url="https://jobs.example.com/listing/12345",
schema=JobListing,
instruction="Extract the job title, company, location, salary if mentioned, required skills, and posting date."
)
The instruction phrase "if mentioned" matters. Without it, the model sometimes invents a salary when none is on the page. Explicit permission to omit a field produces cleaner output than leaving it ambiguous.
Scrape 50 listings a day, store the structured results in a database, and you have a job market intelligence tool. Filter by skills_required to find which skills are appearing more or less frequently over time.
Use case 3: news article summarization
News extraction is where the LLM earns its place most clearly. Summarization isn't possible with traditional selectors — you need semantic understanding of the content.
class NewsArticle(BaseModel):
headline: str
summary: str
key_entities: list[str]
sentiment: str
published_at: str | None
article = extract_from_url(
url="https://news.example.com/article/abc",
schema=NewsArticle,
instruction="Extract headline, a 2-sentence summary, key people/companies/places mentioned, sentiment (positive/negative/neutral), and publication date."
)
The sentiment field is constrained to three values in the instruction. The model respects this. You could go further and add a Literal["positive", "negative", "neutral"] type annotation in Pydantic, which would catch any model deviation at validation time. For the research prompt library, I use exactly this pattern for processing news sources.
Handling failures gracefully
Three failure modes you'll hit in production, and how to handle each.
Firecrawl timeout. Dynamic pages sometimes need more time to render. Retry with exponential backoff — most timeouts resolve on the second attempt. Firecrawl's Python client has a timeout parameter; setting it to 30 seconds instead of the default 10 seconds resolves most single-page app rendering issues.
Paywalled content. Firecrawl can handle many JavaScript rendering cases, but it can't bypass authentication or paywalls. Accept this constraint upfront. Scrape what's available in the preview text, and note in your schema that full_content may be None for paywalled sources. Don't try to work around paywalls.
Missing optional fields. Use str | None in Pydantic for any field that might not be present on a given page. Never use str for optional data. The model will sometimes guess rather than omit if you force a non-optional type, and a plausible-sounding guess is worse than an explicit None.
from pydantic import BaseModel
from typing import Optional
class ProductListing(BaseModel):
name: str
price: str
description: Optional[str] = None
rating: Optional[float] = None
review_count: Optional[int] = None
The = None default means even if the model omits the field entirely, Pydantic won't throw a validation error. You get None instead of an exception.
Scaling to 1,000+ pages per day
The core function is synchronous. For volume work, wrap it in asyncio:
import asyncio
async def extract_batch(urls: list[str], schema, instruction: str) -> list:
tasks = [
asyncio.to_thread(extract_from_url, url, schema, instruction)
for url in urls
]
return await asyncio.gather(*tasks, return_exceptions=True)
return_exceptions=True means one failed URL doesn't crash the whole batch. You get a list where some items are results and some are exceptions — filter them apart and retry the exceptions.
The cost math for 1,000 pages per day: each page produces roughly 6KB of markdown, which is about 1,500 tokens. Add 500 tokens for the schema and instruction, and you're around 2,000 input tokens per page. At $0.003 per 1K tokens via aicredits.in, that's roughly $6 per day for input tokens. Output tokens (the extracted JSON) add maybe 20%. Total: around $7–8 per day for 1,000 structured extractions.
Compare that to the engineering time maintaining brittle selectors. The math isn't close.
For very high volume, Firecrawl's batch scrape endpoint is more efficient than individual calls — it handles rate limiting and retries internally. Check their docs for batch_scrape_urls once you're above a few hundred pages per run.
The extraction instruction is your real selector
The mental model shift worth internalizing: in this stack, the extraction instruction does the job that CSS selectors used to do. When the extraction is wrong, don't reach for the DOM — rewrite the instruction.
"Extract the price" is a bad instruction. "Extract the current price in USD, not the strikethrough original price" is a better one. "Extract the price shown below the product title, which may include a currency symbol" is better still on ambiguous pages.
Specificity in the instruction produces specificity in the output. The same iterative debugging mindset from prompt engineering for structured outputs applies here — test a few sample pages, look at where the extraction goes wrong, and tighten the instruction until it doesn't.
This approach handles the long tail of scraping problems — responsive layouts, A/B tests, personalized content, gradual site redesigns — that make traditional scrapers a maintenance burden. The page structure can change completely. If the information is still there, the instruction still finds it.


