What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

How to Build a RAG System Over Indian Government Data with Claude

India has a surprisingly rich open data ecosystem that almost no developer is using. RBI circulars going back to 2000. The DPIIT startup registry with 100,000+ registered startups. BSE company filings. 7,000+ datasets on data.gov.in. Open FSSAI food licensing data.

Most Indian developers know this data exists but treat it as a compliance headache rather than a developer resource. That's the opportunity: almost no one has built good developer tools on top of Indian regulatory data.

This tutorial builds a Claude-powered RAG system that can answer natural language questions about RBI circulars. Ask "What did RBI say about BNPL regulations in 2024?" and get an accurate answer with the source circular number and date. The same pattern works for any of the datasets below.

The Indian open data landscape

Before writing code, here's what's actually available:

Source	Data Available	Format	Auth Required
RBI (rbi.org.in)	Circulars, policy docs, reports, 2000–present	PDF, HTML	No
DPIIT Startup India	Registered startup details, funding data	JSON API	Free API key
data.gov.in	7,000+ government datasets	CSV, JSON, XML	No (most datasets)
BSE India	Listed company filings, quarterly results	PDF, CSV	No
NSE India	Trading data, corporate announcements	JSON API	No
Open FSSAI	Food business licensing data	CSV	No
MCA (mca.gov.in)	Company registration data	Web/PDF	No

The RBI data is particularly valuable because:

It's comprehensive (20+ years of circulars)
Financial regulations change frequently
The consequences of non-compliance are serious
The documents are hard to navigate manually

We'll build the RBI circular QA system. By the end you'll have something you can actually use to answer questions like "What's the current RBI stance on pre-payment penalties for personal loans?"

What we're building

A RAG chatbot over RBI circulars from 2022-2026. Stack:

Python 3.10+ with LangChain
ChromaDB for local vector storage (no external service needed)
Claude via AICredits.in — Claude Sonnet for the QA chain, with UPI billing so you don't need an international card
Streamlit for the web UI
pypdf for PDF text extraction

The system answers questions by retrieving relevant circulars and synthesising an answer. It cites the circular number and date, so you can verify the source.

Install dependencies:

pip install langchain langchain-openai langchain-community \
    chromadb pypdf requests streamlit beautifulsoup4

Step 1: Collect and preprocess RBI circulars

RBI publishes circulars at https://www.rbi.org.in/scripts/BS_CircularIndexDisplay.aspx. Each circular links to a PDF. We'll scrape the listing and download PDFs.

# collect_rbi.py
import requests
import os
import time
from bs4 import BeautifulSoup
from pathlib import Path
import pypdf
import json

DATA_DIR = Path("./rbi_data")
PDF_DIR = DATA_DIR / "pdfs"
TEXT_DIR = DATA_DIR / "texts"

PDF_DIR.mkdir(parents=True, exist_ok=True)
TEXT_DIR.mkdir(parents=True, exist_ok=True)

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; ResearchBot/1.0)",
}

RBI_BASE = "https://www.rbi.org.in"


def get_circular_links(year: int) -> list[dict]:
    """Scrape circular listing page for a given year."""
    url = f"{RBI_BASE}/scripts/BS_CircularIndexDisplay.aspx?year={year}"
    response = requests.get(url, headers=HEADERS, timeout=30)
    soup = BeautifulSoup(response.text, "html.parser")

    circulars = []
    # RBI's table structure: each row has date, circular number, subject, PDF link
    for row in soup.find_all("tr"):
        cells = row.find_all("td")
        if len(cells) < 3:
            continue

        pdf_link = row.find("a", href=lambda h: h and ".pdf" in h.lower())
        if not pdf_link:
            continue

        href = pdf_link.get("href", "")
        full_url = href if href.startswith("http") else f"{RBI_BASE}{href}"

        circulars.append({
            "date": cells[0].get_text(strip=True) if cells else "",
            "circular_number": cells[1].get_text(strip=True) if len(cells) > 1 else "",
            "subject": cells[2].get_text(strip=True) if len(cells) > 2 else "",
            "pdf_url": full_url,
            "year": year,
        })

    return circulars


def download_and_extract(circular: dict) -> dict | None:
    """Download PDF and extract text."""
    # Sanitise filename from circular number
    safe_name = circular["circular_number"].replace("/", "_").replace(" ", "_")
    pdf_path = PDF_DIR / f"{safe_name}.pdf"
    text_path = TEXT_DIR / f"{safe_name}.txt"

    # Skip if already processed
    if text_path.exists():
        return None

    try:
        # Download PDF
        response = requests.get(circular["pdf_url"], headers=HEADERS, timeout=30)
        response.raise_for_status()
        pdf_path.write_bytes(response.content)

        # Extract text
        reader = pypdf.PdfReader(str(pdf_path))
        text = "\n".join(page.extract_text() or "" for page in reader.pages)
        text = text.strip()

        if len(text) < 100:  # skip nearly-empty extractions
            return None

        text_path.write_text(text, encoding="utf-8")

        return {
            **circular,
            "text_path": str(text_path),
            "char_count": len(text),
        }

    except Exception as e:
        print(f"Failed to process {circular['circular_number']}: {e}")
        return None


def collect_circulars(years: list[int] = None):
    if years is None:
        years = [2022, 2023, 2024, 2025, 2026]

    all_circulars = []
    for year in years:
        print(f"Fetching {year} circular listings...")
        circulars = get_circular_links(year)
        print(f"  Found {len(circulars)} circulars")
        all_circulars.extend(circulars)

    print(f"\nDownloading {len(all_circulars)} circulars...")
    processed = []
    for i, circular in enumerate(all_circulars):
        result = download_and_extract(circular)
        if result:
            processed.append(result)
            print(f"  [{i+1}/{len(all_circulars)}] {circular['circular_number']}")
        time.sleep(0.5)  # be polite to RBI's servers

    # Save manifest
    manifest_path = DATA_DIR / "manifest.json"
    manifest_path.write_text(json.dumps(processed, indent=2))
    print(f"\nDone. Processed {len(processed)} circulars. Manifest at {manifest_path}")


if __name__ == "__main__":
    collect_circulars()

Run this once to collect the data:

python collect_rbi.py

It'll take 15-30 minutes to download a few years of circulars. The script skips already-downloaded files, so you can interrupt and restart safely.

Step 2: Chunk and embed the documents

Chunking strategy matters a lot for regulatory documents. RBI circulars often have numbered sections with independent meaning — a 500-character chunk that cuts mid-paragraph loses context. We'll use RecursiveCharacterTextSplitter with overlapping chunks and preserve metadata so we can cite sources.

# build_index.py
import json
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

DATA_DIR = Path("./rbi_data")
CHROMA_DIR = "./rbi_chroma"

# Load manifest
manifest = json.loads((DATA_DIR / "manifest.json").read_text())

# Prepare documents with metadata
from langchain.schema import Document

documents = []
for circular in manifest:
    text_path = Path(circular["text_path"])
    if not text_path.exists():
        continue

    text = text_path.read_text(encoding="utf-8")

    # Preserve structured metadata for citation
    metadata = {
        "circular_number": circular["circular_number"],
        "date": circular["date"],
        "subject": circular["subject"],
        "year": str(circular["year"]),
        "source": circular["pdf_url"],
    }

    documents.append(Document(page_content=text, metadata=metadata))

print(f"Loaded {len(documents)} circulars")

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

# Create embeddings using AICredits.in endpoint
# AICredits exposes an OpenAI-compatible endpoint — use text-embedding-3-small
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.environ["AICREDITS_API_KEY"],
    openai_api_base="https://api.aicredits.in/v1",
)

# Build ChromaDB index
print("Building ChromaDB index (this takes a few minutes for large datasets)...")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=CHROMA_DIR,
)
vectorstore.persist()
print(f"Index saved to {CHROMA_DIR}")

For ~500 circulars, this typically takes 5-10 minutes and costs around ₹15-25 (text-embedding-3-small is very cheap). The ChromaDB index is saved locally — you only need to build it once.

Step 3: Build the Claude-powered QA system

# qa_system.py
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

CHROMA_DIR = "./rbi_chroma"

# Load the existing ChromaDB index
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.environ["AICREDITS_API_KEY"],
    openai_api_base="https://api.aicredits.in/v1",
)

vectorstore = Chroma(
    persist_directory=CHROMA_DIR,
    embedding_function=embeddings,
)

# Claude Sonnet via AICredits.in
llm = ChatOpenAI(
    model="anthropic/claude-sonnet-4-6",
    openai_api_key=os.environ["AICREDITS_API_KEY"],
    openai_api_base="https://api.aicredits.in/v1",
    temperature=0,  # deterministic for regulatory Q&A
)

# Custom prompt that keeps Claude grounded in the source material
PROMPT_TEMPLATE = """You are a regulatory research assistant specialising in RBI (Reserve Bank of India) circulars and policy documents.

Answer the question based ONLY on the provided RBI circular excerpts. Do not use any knowledge outside these excerpts.

If the answer is not found in the provided circulars, say: "I could not find specific guidance on this in the available circulars. You should check the RBI website directly at rbi.org.in for current guidance."

Always cite the circular number and date when you provide information.

Context from RBI circulars:
{context}

Question: {question}

Answer (cite circular numbers and dates):"""

PROMPT = PromptTemplate(
    template=PROMPT_TEMPLATE,
    input_variables=["context", "question"],
)

# Build the QA chain
retriever = vectorstore.as_retriever(
    search_type="mmr",       # Maximum Marginal Relevance — reduces redundant results
    search_kwargs={"k": 5},  # retrieve 5 most relevant chunks
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT},
)


def ask_rbi(question: str) -> dict:
    """Query the RBI circular database."""
    result = qa_chain.invoke({"query": question})

    # Extract unique source circulars for display
    sources = []
    seen = set()
    for doc in result["source_documents"]:
        circular_num = doc.metadata.get("circular_number", "Unknown")
        if circular_num not in seen:
            seen.add(circular_num)
            sources.append({
                "circular": circular_num,
                "date": doc.metadata.get("date", ""),
                "subject": doc.metadata.get("subject", ""),
            })

    return {
        "answer": result["result"],
        "sources": sources,
    }


if __name__ == "__main__":
    # Quick test
    result = ask_rbi("What are RBI's current regulations on digital lending apps?")
    print(result["answer"])
    print("\nSources used:")
    for s in result["sources"]:
        print(f"  - {s['circular']} ({s['date']}): {s['subject']}")

The key choices here:

temperature=0 — regulatory Q&A benefits from determinism. You don't want creative answers about banking regulations.
MMR retrieval — prevents the chain from retrieving 5 chunks from the same circular when multiple circulars are relevant
The system prompt explicitly restricts Claude to the provided context. Without this, Claude will confidently answer from training data rather than citing the actual circulars.

Step 4: A simple web UI with Streamlit

# app.py
import streamlit as st
from qa_system import ask_rbi

st.set_page_config(
    page_title="RBI Circular Search",
    page_icon="🏦",
    layout="wide",
)

st.title("RBI Circular Q&A")
st.markdown("Ask questions about RBI regulations. Answers are grounded in RBI circulars from 2022-2026.")

query = st.text_input(
    "Your question",
    placeholder="What did RBI say about BNPL regulations in 2024?",
)

if st.button("Search", type="primary") and query:
    with st.spinner("Searching RBI circulars..."):
        result = ask_rbi(query)

    st.markdown("### Answer")
    st.write(result["answer"])

    if result["sources"]:
        st.markdown("### Source circulars")
        for source in result["sources"]:
            with st.expander(f"{source['circular']} — {source['date']}"):
                st.write(f"**Subject:** {source['subject']}")

Run it:

AICREDITS_API_KEY=your_key streamlit run app.py

Extending this to other datasets

DPIIT startup data

The DPIIT Startup India portal has a free API for registered startup data. Get an API key at startup.gov.in.

# Add to your MCP server or standalone script
import requests

def get_startup_stats(sector: str = None, state: str = None) -> dict:
    """Query DPIIT startup registry."""
    DPIIT_API = "https://api.startupindia.gov.in/sih/api/search/profiles/startup"
    params = {
        "startupSector": sector,
        "state": state,
        "pageNo": 0,
        "pageSize": 20,
    }
    headers = {
        "Authorization": f"Bearer {os.environ['DPIIT_API_KEY']}",
        "Content-Type": "application/json",
    }
    response = requests.get(DPIIT_API, params=params, headers=headers)
    return response.json()

Build a chatbot that answers "How many fintech startups are registered in Maharashtra?" and the DPIIT data answers it precisely.

BSE company filings

BSE publishes quarterly results, annual reports, and corporate announcements as PDFs. The filing index is at https://www.bseindia.com/corporates/Comp_Resultsnew.aspx.

Apply the same RAG pattern: download PDFs, extract text, embed, build a QA chain. You end up with a system that can answer "What was Infosys' EBITDA margin in Q3 FY26?" from actual filings rather than relying on model training data.

The difference from the RBI system: financial documents have tables that don't extract cleanly from PDFs. You'll want to add a table extraction step (use camelot-py or pdfplumber for PDFs with tables rather than pypdf).

Try it now with AICredits.in

Access Claude, GPT-4o, Gemini, and 300+ models with UPI payment in ₹. No international card needed. Create free account →

Next steps

RAG lesson — the conceptual foundation for everything we built here: embeddings, vector search, retrieval augmentation
How RAG works — deeper technical dive into the retrieval mechanism
Build a Python AI app with LangChain — full stack Python AI application tutorial
Build a customer support AI agent — applying similar techniques to a different use case

The Indian open data landscape

Before writing code, here's what's actually available:

Source	Data Available	Format	Auth Required
RBI (rbi.org.in)	Circulars, policy docs, reports, 2000–present	PDF, HTML	No
DPIIT Startup India	Registered startup details, funding data	JSON API	Free API key
data.gov.in	7,000+ government datasets	CSV, JSON, XML	No (most datasets)
BSE India	Listed company filings, quarterly results	PDF, CSV	No
NSE India	Trading data, corporate announcements	JSON API	No
Open FSSAI	Food business licensing data	CSV	No
MCA (mca.gov.in)	Company registration data	Web/PDF	No

The RBI data is particularly valuable because:

It's comprehensive (20+ years of circulars)
Financial regulations change frequently
The consequences of non-compliance are serious
The documents are hard to navigate manually

We'll build the RBI circular QA system. By the end you'll have something you can actually use to answer questions like "What's the current RBI stance on pre-payment penalties for personal loans?"

What we're building

A RAG chatbot over RBI circulars from 2022-2026. Stack:

Python 3.10+ with LangChain
ChromaDB for local vector storage (no external service needed)
Claude via AICredits.in — Claude Sonnet for the QA chain, with UPI billing so you don't need an international card
Streamlit for the web UI
pypdf for PDF text extraction

The system answers questions by retrieving relevant circulars and synthesising an answer. It cites the circular number and date, so you can verify the source.

Install dependencies:

pip install langchain langchain-openai langchain-community \
    chromadb pypdf requests streamlit beautifulsoup4

Step 1: Collect and preprocess RBI circulars

RBI publishes circulars at https://www.rbi.org.in/scripts/BS_CircularIndexDisplay.aspx. Each circular links to a PDF. We'll scrape the listing and download PDFs.

# collect_rbi.py
import requests
import os
import time
from bs4 import BeautifulSoup
from pathlib import Path
import pypdf
import json

DATA_DIR = Path("./rbi_data")
PDF_DIR = DATA_DIR / "pdfs"
TEXT_DIR = DATA_DIR / "texts"

PDF_DIR.mkdir(parents=True, exist_ok=True)
TEXT_DIR.mkdir(parents=True, exist_ok=True)

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; ResearchBot/1.0)",
}

RBI_BASE = "https://www.rbi.org.in"


def get_circular_links(year: int) -> list[dict]:
    """Scrape circular listing page for a given year."""
    url = f"{RBI_BASE}/scripts/BS_CircularIndexDisplay.aspx?year={year}"
    response = requests.get(url, headers=HEADERS, timeout=30)
    soup = BeautifulSoup(response.text, "html.parser")

    circulars = []
    # RBI's table structure: each row has date, circular number, subject, PDF link
    for row in soup.find_all("tr"):
        cells = row.find_all("td")
        if len(cells) < 3:
            continue

        pdf_link = row.find("a", href=lambda h: h and ".pdf" in h.lower())
        if not pdf_link:
            continue

        href = pdf_link.get("href", "")
        full_url = href if href.startswith("http") else f"{RBI_BASE}{href}"

        circulars.append({
            "date": cells[0].get_text(strip=True) if cells else "",
            "circular_number": cells[1].get_text(strip=True) if len(cells) > 1 else "",
            "subject": cells[2].get_text(strip=True) if len(cells) > 2 else "",
            "pdf_url": full_url,
            "year": year,
        })

    return circulars


def download_and_extract(circular: dict) -> dict | None:
    """Download PDF and extract text."""
    # Sanitise filename from circular number
    safe_name = circular["circular_number"].replace("/", "_").replace(" ", "_")
    pdf_path = PDF_DIR / f"{safe_name}.pdf"
    text_path = TEXT_DIR / f"{safe_name}.txt"

    # Skip if already processed
    if text_path.exists():
        return None

    try:
        # Download PDF
        response = requests.get(circular["pdf_url"], headers=HEADERS, timeout=30)
        response.raise_for_status()
        pdf_path.write_bytes(response.content)

        # Extract text
        reader = pypdf.PdfReader(str(pdf_path))
        text = "\n".join(page.extract_text() or "" for page in reader.pages)
        text = text.strip()

        if len(text) < 100:  # skip nearly-empty extractions
            return None

        text_path.write_text(text, encoding="utf-8")

        return {
            **circular,
            "text_path": str(text_path),
            "char_count": len(text),
        }

    except Exception as e:
        print(f"Failed to process {circular['circular_number']}: {e}")
        return None


def collect_circulars(years: list[int] = None):
    if years is None:
        years = [2022, 2023, 2024, 2025, 2026]

    all_circulars = []
    for year in years:
        print(f"Fetching {year} circular listings...")
        circulars = get_circular_links(year)
        print(f"  Found {len(circulars)} circulars")
        all_circulars.extend(circulars)

    print(f"\nDownloading {len(all_circulars)} circulars...")
    processed = []
    for i, circular in enumerate(all_circulars):
        result = download_and_extract(circular)
        if result:
            processed.append(result)
            print(f"  [{i+1}/{len(all_circulars)}] {circular['circular_number']}")
        time.sleep(0.5)  # be polite to RBI's servers

    # Save manifest
    manifest_path = DATA_DIR / "manifest.json"
    manifest_path.write_text(json.dumps(processed, indent=2))
    print(f"\nDone. Processed {len(processed)} circulars. Manifest at {manifest_path}")


if __name__ == "__main__":
    collect_circulars()

Run this once to collect the data:

python collect_rbi.py

It'll take 15-30 minutes to download a few years of circulars. The script skips already-downloaded files, so you can interrupt and restart safely.

Step 2: Chunk and embed the documents

# build_index.py
import json
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

DATA_DIR = Path("./rbi_data")
CHROMA_DIR = "./rbi_chroma"

# Load manifest
manifest = json.loads((DATA_DIR / "manifest.json").read_text())

# Prepare documents with metadata
from langchain.schema import Document

documents = []
for circular in manifest:
    text_path = Path(circular["text_path"])
    if not text_path.exists():
        continue

    text = text_path.read_text(encoding="utf-8")

    # Preserve structured metadata for citation
    metadata = {
        "circular_number": circular["circular_number"],
        "date": circular["date"],
        "subject": circular["subject"],
        "year": str(circular["year"]),
        "source": circular["pdf_url"],
    }

    documents.append(Document(page_content=text, metadata=metadata))

print(f"Loaded {len(documents)} circulars")

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

# Create embeddings using AICredits.in endpoint
# AICredits exposes an OpenAI-compatible endpoint — use text-embedding-3-small
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.environ["AICREDITS_API_KEY"],
    openai_api_base="https://api.aicredits.in/v1",
)

# Build ChromaDB index
print("Building ChromaDB index (this takes a few minutes for large datasets)...")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=CHROMA_DIR,
)
vectorstore.persist()
print(f"Index saved to {CHROMA_DIR}")

For ~500 circulars, this typically takes 5-10 minutes and costs around ₹15-25 (text-embedding-3-small is very cheap). The ChromaDB index is saved locally — you only need to build it once.

Step 3: Build the Claude-powered QA system

# qa_system.py
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

CHROMA_DIR = "./rbi_chroma"

# Load the existing ChromaDB index
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.environ["AICREDITS_API_KEY"],
    openai_api_base="https://api.aicredits.in/v1",
)

vectorstore = Chroma(
    persist_directory=CHROMA_DIR,
    embedding_function=embeddings,
)

# Claude Sonnet via AICredits.in
llm = ChatOpenAI(
    model="anthropic/claude-sonnet-4-6",
    openai_api_key=os.environ["AICREDITS_API_KEY"],
    openai_api_base="https://api.aicredits.in/v1",
    temperature=0,  # deterministic for regulatory Q&A
)

# Custom prompt that keeps Claude grounded in the source material
PROMPT_TEMPLATE = """You are a regulatory research assistant specialising in RBI (Reserve Bank of India) circulars and policy documents.

Answer the question based ONLY on the provided RBI circular excerpts. Do not use any knowledge outside these excerpts.

If the answer is not found in the provided circulars, say: "I could not find specific guidance on this in the available circulars. You should check the RBI website directly at rbi.org.in for current guidance."

Always cite the circular number and date when you provide information.

Context from RBI circulars:
{context}

Question: {question}

Answer (cite circular numbers and dates):"""

PROMPT = PromptTemplate(
    template=PROMPT_TEMPLATE,
    input_variables=["context", "question"],
)

# Build the QA chain
retriever = vectorstore.as_retriever(
    search_type="mmr",       # Maximum Marginal Relevance — reduces redundant results
    search_kwargs={"k": 5},  # retrieve 5 most relevant chunks
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT},
)


def ask_rbi(question: str) -> dict:
    """Query the RBI circular database."""
    result = qa_chain.invoke({"query": question})

    # Extract unique source circulars for display
    sources = []
    seen = set()
    for doc in result["source_documents"]:
        circular_num = doc.metadata.get("circular_number", "Unknown")
        if circular_num not in seen:
            seen.add(circular_num)
            sources.append({
                "circular": circular_num,
                "date": doc.metadata.get("date", ""),
                "subject": doc.metadata.get("subject", ""),
            })

    return {
        "answer": result["result"],
        "sources": sources,
    }


if __name__ == "__main__":
    # Quick test
    result = ask_rbi("What are RBI's current regulations on digital lending apps?")
    print(result["answer"])
    print("\nSources used:")
    for s in result["sources"]:
        print(f"  - {s['circular']} ({s['date']}): {s['subject']}")

The key choices here:

temperature=0 — regulatory Q&A benefits from determinism. You don't want creative answers about banking regulations.
MMR retrieval — prevents the chain from retrieving 5 chunks from the same circular when multiple circulars are relevant
The system prompt explicitly restricts Claude to the provided context. Without this, Claude will confidently answer from training data rather than citing the actual circulars.

Step 4: A simple web UI with Streamlit

# app.py
import streamlit as st
from qa_system import ask_rbi

st.set_page_config(
    page_title="RBI Circular Search",
    page_icon="🏦",
    layout="wide",
)

st.title("RBI Circular Q&A")
st.markdown("Ask questions about RBI regulations. Answers are grounded in RBI circulars from 2022-2026.")

query = st.text_input(
    "Your question",
    placeholder="What did RBI say about BNPL regulations in 2024?",
)

if st.button("Search", type="primary") and query:
    with st.spinner("Searching RBI circulars..."):
        result = ask_rbi(query)

    st.markdown("### Answer")
    st.write(result["answer"])

    if result["sources"]:
        st.markdown("### Source circulars")
        for source in result["sources"]:
            with st.expander(f"{source['circular']} — {source['date']}"):
                st.write(f"**Subject:** {source['subject']}")

Run it:

AICREDITS_API_KEY=your_key streamlit run app.py

Extending this to other datasets

DPIIT startup data

The DPIIT Startup India portal has a free API for registered startup data. Get an API key at startup.gov.in.

# Add to your MCP server or standalone script
import requests

def get_startup_stats(sector: str = None, state: str = None) -> dict:
    """Query DPIIT startup registry."""
    DPIIT_API = "https://api.startupindia.gov.in/sih/api/search/profiles/startup"
    params = {
        "startupSector": sector,
        "state": state,
        "pageNo": 0,
        "pageSize": 20,
    }
    headers = {
        "Authorization": f"Bearer {os.environ['DPIIT_API_KEY']}",
        "Content-Type": "application/json",
    }
    response = requests.get(DPIIT_API, params=params, headers=headers)
    return response.json()

Build a chatbot that answers "How many fintech startups are registered in Maharashtra?" and the DPIIT data answers it precisely.

BSE company filings

BSE publishes quarterly results, annual reports, and corporate announcements as PDFs. The filing index is at https://www.bseindia.com/corporates/Comp_Resultsnew.aspx.

Try it now with AICredits.in

Access Claude, GPT-4o, Gemini, and 300+ models with UPI payment in ₹. No international card needed. Create free account →

Next steps

RAG lesson — the conceptual foundation for everything we built here: embeddings, vector search, retrieval augmentation
How RAG works — deeper technical dive into the retrieval mechanism
Build a Python AI app with LangChain — full stack Python AI application tutorial
Build a customer support AI agent — applying similar techniques to a different use case

How to Build a RAG System Over Indian Government Data with Claude

The Indian open data landscape

What we're building

Step 1: Collect and preprocess RBI circulars

Step 2: Chunk and embed the documents

Step 3: Build the Claude-powered QA system

Step 4: A simple web UI with Streamlit

Extending this to other datasets

DPIIT startup data

BSE company filings

Try it now with AICredits.in

Next steps

Related articles

AI Engineering Career Roadmap for Indian Developers: SDET/Backend to LLM Engineer in 6 Months

25 AI Prompts for Indian Startup Founders: Product, Pitch Deck, Investor Emails, and GTM

Anthropic's Claude for Open Source: How Indian Developers Can Get Claude Max Free

How to Build a RAG System Over Indian Government Data with Claude

The Indian open data landscape

What we're building

Step 1: Collect and preprocess RBI circulars

Step 2: Chunk and embed the documents

Step 3: Build the Claude-powered QA system

Step 4: A simple web UI with Streamlit

Extending this to other datasets

DPIIT startup data

BSE company filings

Try it now with AICredits.in

Next steps

Related articles

AI Engineering Career Roadmap for Indian Developers: SDET/Backend to LLM Engineer in 6 Months

25 AI Prompts for Indian Startup Founders: Product, Pitch Deck, Investor Emails, and GTM

Anthropic's Claude for Open Source: How Indian Developers Can Get Claude Max Free