What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Web Data Extraction with Firecrawl and LLMs: The Modern Scraping Stack

BeautifulSoup selectors break the moment a site redesigns. XPath is brittle. Both fail completely on JavaScript-rendered pages. I've maintained scrapers that needed patching every two weeks because some product team moved a div. There's a better stack now.

Firecrawl converts any URL to clean markdown. An LLM pulls out exactly the data you want. The selector never breaks because there is no selector. The extraction instruction is in plain English, and when the page layout changes, the same instruction still works.

Why clean markdown first

Raw HTML is roughly 10× noisier than the same content as markdown. A typical product page is 80KB of HTML. Navigation, cookie banners, footer links, ad scripts, inline styles, tracking pixels — all of it is in there, and none of it is the data you want. The same content as markdown is around 6KB, with all that noise stripped.

Feeding an LLM raw HTML has two problems. First, you're burning tokens on content the model needs to ignore. Second, the structural complexity of HTML actively confuses extraction — the model has to work out what's content and what's markup, and it doesn't always get it right.

Firecrawl handles the hard parts: JavaScript rendering, waiting for dynamic content to load, stripping navigation and ads, and returning clean prose. You send a URL and get back something that looks like a well-formatted article. Then you hand that to the LLM with a specific extraction instruction.

Setup

pip install firecrawl-py openai pydantic python-dotenv

Get your Firecrawl API key at firecrawl.dev — the free tier covers 500 pages per month, which is enough to get production-quality results before you commit to a paid plan. Then set your environment variables:

FIRECRAWL_API_KEY=fc-your-key
AICREDITS_API_KEY=sk-your-aicredits-key

Indian developers: access Claude, GPT-4o, and Gemini through AICredits.in — INR billing, UPI top-up, no international card.

The core pipeline

Every extraction follows the same three-step pattern: fetch the page as markdown, define a Pydantic schema for what you want, ask the LLM to fill it.

import os
import json
import firecrawl
from openai import OpenAI
from pydantic import BaseModel

fc = firecrawl.FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
client = OpenAI(
    api_key=os.environ["AICREDITS_API_KEY"],
    base_url="https://api.aicredits.in/v1"
)

def extract_from_url(url: str, schema: type[BaseModel], instruction: str) -> BaseModel:
    result = fc.scrape_url(url, params={"formats": ["markdown"]})
    markdown = result["markdown"]

    response = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-6",
        messages=[
            {
                "role": "system",
                "content": f"Extract data as JSON matching this schema exactly:\n{schema.schema_json()}"
            },
            {
                "role": "user",
                "content": f"{instruction}\n\nContent:\n{markdown[:8000]}"
            }
        ],
        response_format={"type": "json_object"}
    )
    return schema(**json.loads(response.choices[0].message.content))

The response_format={"type": "json_object"} forces the model to return valid JSON — no markdown fences, no explanation, just the object. Pydantic validates it on the way out. If the model returns a field with the wrong type, you get a validation error immediately instead of a silent data corruption issue downstream. See the structured outputs post for a deeper look at why this matters.

The markdown[:8000] truncation is intentional. Most product pages, job listings, and articles contain all the useful information in the first 8,000 characters of cleaned markdown. If you're working with very long pages — documentation, research papers, dense reports — increase that limit, but be aware you're paying for those extra tokens.

Use case 1: competitor price monitoring

The schema does the heavy lifting. Define exactly what you want to extract, and the model figures out where on the page to find it.

class PricingPage(BaseModel):
    company_name: str
    plans: list[dict]
    last_updated: str | None

pricing = extract_from_url(
    url="https://competitor.com/pricing",
    schema=PricingPage,
    instruction="Extract all pricing plan details including plan names, prices, billing periods, and key features."
)

The plans field uses list[dict] here rather than a nested Pydantic model because pricing page structures vary wildly. One competitor has three tiers. Another has 12 add-ons. A rigid schema breaks on the unusual cases; a list of dicts accepts whatever the model finds.

Run this nightly with a cron job. After each run, diff the result against yesterday's output and send a Slack message when any price changes. I've had teams discover competitor pricing changes this way before the sales team did.

Use case 2: job listing aggregator

Job boards are notorious for inconsistent structure. Some show salary, some don't. Some list required skills, some list "nice to haves." The str | None pattern handles optional fields cleanly.

class JobListing(BaseModel):
    title: str
    company: str
    location: str
    salary_range: str | None
    skills_required: list[str]
    posted_date: str | None

job = extract_from_url(
    url="https://jobs.example.com/listing/12345",
    schema=JobListing,
    instruction="Extract the job title, company, location, salary if mentioned, required skills, and posting date."
)

The instruction phrase "if mentioned" matters. Without it, the model sometimes invents a salary when none is on the page. Explicit permission to omit a field produces cleaner output than leaving it ambiguous.

Scrape 50 listings a day, store the structured results in a database, and you have a job market intelligence tool. Filter by skills_required to find which skills are appearing more or less frequently over time.

Use case 3: news article summarization

News extraction is where the LLM earns its place most clearly. Summarization isn't possible with traditional selectors — you need semantic understanding of the content.

class NewsArticle(BaseModel):
    headline: str
    summary: str
    key_entities: list[str]
    sentiment: str
    published_at: str | None

article = extract_from_url(
    url="https://news.example.com/article/abc",
    schema=NewsArticle,
    instruction="Extract headline, a 2-sentence summary, key people/companies/places mentioned, sentiment (positive/negative/neutral), and publication date."
)

The sentiment field is constrained to three values in the instruction. The model respects this. You could go further and add a Literal["positive", "negative", "neutral"] type annotation in Pydantic, which would catch any model deviation at validation time. For the research prompt library, I use exactly this pattern for processing news sources.

Handling failures gracefully

Three failure modes you'll hit in production, and how to handle each.

Firecrawl timeout. Dynamic pages sometimes need more time to render. Retry with exponential backoff — most timeouts resolve on the second attempt. Firecrawl's Python client has a timeout parameter; setting it to 30 seconds instead of the default 10 seconds resolves most single-page app rendering issues.

Paywalled content. Firecrawl can handle many JavaScript rendering cases, but it can't bypass authentication or paywalls. Accept this constraint upfront. Scrape what's available in the preview text, and note in your schema that full_content may be None for paywalled sources. Don't try to work around paywalls.

Missing optional fields. Use str | None in Pydantic for any field that might not be present on a given page. Never use str for optional data. The model will sometimes guess rather than omit if you force a non-optional type, and a plausible-sounding guess is worse than an explicit None.

from pydantic import BaseModel
from typing import Optional

class ProductListing(BaseModel):
    name: str
    price: str
    description: Optional[str] = None
    rating: Optional[float] = None
    review_count: Optional[int] = None

The = None default means even if the model omits the field entirely, Pydantic won't throw a validation error. You get None instead of an exception.

Scaling to 1,000+ pages per day

The core function is synchronous. For volume work, wrap it in asyncio:

import asyncio

async def extract_batch(urls: list[str], schema, instruction: str) -> list:
    tasks = [
        asyncio.to_thread(extract_from_url, url, schema, instruction)
        for url in urls
    ]
    return await asyncio.gather(*tasks, return_exceptions=True)

return_exceptions=True means one failed URL doesn't crash the whole batch. You get a list where some items are results and some are exceptions — filter them apart and retry the exceptions.

The cost math for 1,000 pages per day: each page produces roughly 6KB of markdown, which is about 1,500 tokens. Add 500 tokens for the schema and instruction, and you're around 2,000 input tokens per page. At $0.003 per 1K tokens via aicredits.in, that's roughly $6 per day for input tokens. Output tokens (the extracted JSON) add maybe 20%. Total: around $7–8 per day for 1,000 structured extractions.

Compare that to the engineering time maintaining brittle selectors. The math isn't close.

For very high volume, Firecrawl's batch scrape endpoint is more efficient than individual calls — it handles rate limiting and retries internally. Check their docs for batch_scrape_urls once you're above a few hundred pages per run.

The extraction instruction is your real selector

The mental model shift worth internalizing: in this stack, the extraction instruction does the job that CSS selectors used to do. When the extraction is wrong, don't reach for the DOM — rewrite the instruction.

"Extract the price" is a bad instruction. "Extract the current price in USD, not the strikethrough original price" is a better one. "Extract the price shown below the product title, which may include a currency symbol" is better still on ambiguous pages.

Specificity in the instruction produces specificity in the output. The same iterative debugging mindset from prompt engineering for structured outputs applies here — test a few sample pages, look at where the extraction goes wrong, and tighten the instruction until it doesn't.

This approach handles the long tail of scraping problems — responsive layouts, A/B tests, personalized content, gradual site redesigns — that make traditional scrapers a maintenance burden. The page structure can change completely. If the information is still there, the instruction still finds it.

Why clean markdown first

Setup

pip install firecrawl-py openai pydantic python-dotenv

FIRECRAWL_API_KEY=fc-your-key
AICREDITS_API_KEY=sk-your-aicredits-key

Indian developers: access Claude, GPT-4o, and Gemini through AICredits.in — INR billing, UPI top-up, no international card.

The core pipeline

Every extraction follows the same three-step pattern: fetch the page as markdown, define a Pydantic schema for what you want, ask the LLM to fill it.

import os
import json
import firecrawl
from openai import OpenAI
from pydantic import BaseModel

fc = firecrawl.FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
client = OpenAI(
    api_key=os.environ["AICREDITS_API_KEY"],
    base_url="https://api.aicredits.in/v1"
)

def extract_from_url(url: str, schema: type[BaseModel], instruction: str) -> BaseModel:
    result = fc.scrape_url(url, params={"formats": ["markdown"]})
    markdown = result["markdown"]

    response = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-6",
        messages=[
            {
                "role": "system",
                "content": f"Extract data as JSON matching this schema exactly:\n{schema.schema_json()}"
            },
            {
                "role": "user",
                "content": f"{instruction}\n\nContent:\n{markdown[:8000]}"
            }
        ],
        response_format={"type": "json_object"}
    )
    return schema(**json.loads(response.choices[0].message.content))

Use case 1: competitor price monitoring

The schema does the heavy lifting. Define exactly what you want to extract, and the model figures out where on the page to find it.

class PricingPage(BaseModel):
    company_name: str
    plans: list[dict]
    last_updated: str | None

pricing = extract_from_url(
    url="https://competitor.com/pricing",
    schema=PricingPage,
    instruction="Extract all pricing plan details including plan names, prices, billing periods, and key features."
)

Use case 2: job listing aggregator

Job boards are notorious for inconsistent structure. Some show salary, some don't. Some list required skills, some list "nice to haves." The str | None pattern handles optional fields cleanly.

class JobListing(BaseModel):
    title: str
    company: str
    location: str
    salary_range: str | None
    skills_required: list[str]
    posted_date: str | None

job = extract_from_url(
    url="https://jobs.example.com/listing/12345",
    schema=JobListing,
    instruction="Extract the job title, company, location, salary if mentioned, required skills, and posting date."
)

Use case 3: news article summarization

News extraction is where the LLM earns its place most clearly. Summarization isn't possible with traditional selectors — you need semantic understanding of the content.

class NewsArticle(BaseModel):
    headline: str
    summary: str
    key_entities: list[str]
    sentiment: str
    published_at: str | None

article = extract_from_url(
    url="https://news.example.com/article/abc",
    schema=NewsArticle,
    instruction="Extract headline, a 2-sentence summary, key people/companies/places mentioned, sentiment (positive/negative/neutral), and publication date."
)

Handling failures gracefully

Three failure modes you'll hit in production, and how to handle each.

from pydantic import BaseModel
from typing import Optional

class ProductListing(BaseModel):
    name: str
    price: str
    description: Optional[str] = None
    rating: Optional[float] = None
    review_count: Optional[int] = None

The = None default means even if the model omits the field entirely, Pydantic won't throw a validation error. You get None instead of an exception.

Scaling to 1,000+ pages per day

The core function is synchronous. For volume work, wrap it in asyncio:

import asyncio

async def extract_batch(urls: list[str], schema, instruction: str) -> list:
    tasks = [
        asyncio.to_thread(extract_from_url, url, schema, instruction)
        for url in urls
    ]
    return await asyncio.gather(*tasks, return_exceptions=True)

return_exceptions=True means one failed URL doesn't crash the whole batch. You get a list where some items are results and some are exceptions — filter them apart and retry the exceptions.

Compare that to the engineering time maintaining brittle selectors. The math isn't close.

Web Data Extraction with Firecrawl and LLMs: The Modern Scraping Stack

Why clean markdown first

Setup

The core pipeline

Use case 1: competitor price monitoring

Use case 2: job listing aggregator

Use case 3: news article summarization

Handling failures gracefully

Scaling to 1,000+ pages per day

The extraction instruction is your real selector

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Claude API vs OpenAI API — Developer Comparison Guide (2026)

Web Data Extraction with Firecrawl and LLMs: The Modern Scraping Stack

Why clean markdown first

Setup

The core pipeline

Use case 1: competitor price monitoring

Use case 2: job listing aggregator

Use case 3: news article summarization

Handling failures gracefully

Scaling to 1,000+ pages per day

The extraction instruction is your real selector

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Claude API vs OpenAI API — Developer Comparison Guide (2026)