India has a surprisingly rich open data ecosystem that almost no developer is using. RBI circulars going back to 2000. The DPIIT startup registry with 100,000+ registered startups. BSE company filings. 7,000+ datasets on data.gov.in. Open FSSAI food licensing data.
Most Indian developers know this data exists but treat it as a compliance headache rather than a developer resource. That's the opportunity: almost no one has built good developer tools on top of Indian regulatory data.
This tutorial builds a Claude-powered RAG system that can answer natural language questions about RBI circulars. Ask "What did RBI say about BNPL regulations in 2024?" and get an accurate answer with the source circular number and date. The same pattern works for any of the datasets below.
The Indian open data landscape
Before writing code, here's what's actually available:
| Source | Data Available | Format | Auth Required |
|---|---|---|---|
| RBI (rbi.org.in) | Circulars, policy docs, reports, 2000–present | PDF, HTML | No |
| DPIIT Startup India | Registered startup details, funding data | JSON API | Free API key |
| data.gov.in | 7,000+ government datasets | CSV, JSON, XML | No (most datasets) |
| BSE India | Listed company filings, quarterly results | PDF, CSV | No |
| NSE India | Trading data, corporate announcements | JSON API | No |
| Open FSSAI | Food business licensing data | CSV | No |
| MCA (mca.gov.in) | Company registration data | Web/PDF | No |
The RBI data is particularly valuable because:
- It's comprehensive (20+ years of circulars)
- Financial regulations change frequently
- The consequences of non-compliance are serious
- The documents are hard to navigate manually
We'll build the RBI circular QA system. By the end you'll have something you can actually use to answer questions like "What's the current RBI stance on pre-payment penalties for personal loans?"
What we're building
A RAG chatbot over RBI circulars from 2022-2026. Stack:
- Python 3.10+ with LangChain
- ChromaDB for local vector storage (no external service needed)
- Claude via AICredits.in — Claude Sonnet for the QA chain, with UPI billing so you don't need an international card
- Streamlit for the web UI
- pypdf for PDF text extraction
The system answers questions by retrieving relevant circulars and synthesising an answer. It cites the circular number and date, so you can verify the source.
Install dependencies:
pip install langchain langchain-openai langchain-community \
chromadb pypdf requests streamlit beautifulsoup4
Step 1: Collect and preprocess RBI circulars
RBI publishes circulars at https://www.rbi.org.in/scripts/BS_CircularIndexDisplay.aspx. Each circular links to a PDF. We'll scrape the listing and download PDFs.
# collect_rbi.py
import requests
import os
import time
from bs4 import BeautifulSoup
from pathlib import Path
import pypdf
import json
DATA_DIR = Path("./rbi_data")
PDF_DIR = DATA_DIR / "pdfs"
TEXT_DIR = DATA_DIR / "texts"
PDF_DIR.mkdir(parents=True, exist_ok=True)
TEXT_DIR.mkdir(parents=True, exist_ok=True)
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; ResearchBot/1.0)",
}
RBI_BASE = "https://www.rbi.org.in"
def get_circular_links(year: int) -> list[dict]:
"""Scrape circular listing page for a given year."""
url = f"{RBI_BASE}/scripts/BS_CircularIndexDisplay.aspx?year={year}"
response = requests.get(url, headers=HEADERS, timeout=30)
soup = BeautifulSoup(response.text, "html.parser")
circulars = []
# RBI's table structure: each row has date, circular number, subject, PDF link
for row in soup.find_all("tr"):
cells = row.find_all("td")
if len(cells) < 3:
continue
pdf_link = row.find("a", href=lambda h: h and ".pdf" in h.lower())
if not pdf_link:
continue
href = pdf_link.get("href", "")
full_url = href if href.startswith("http") else f"{RBI_BASE}{href}"
circulars.append({
"date": cells[0].get_text(strip=True) if cells else "",
"circular_number": cells[1].get_text(strip=True) if len(cells) > 1 else "",
"subject": cells[2].get_text(strip=True) if len(cells) > 2 else "",
"pdf_url": full_url,
"year": year,
})
return circulars
def download_and_extract(circular: dict) -> dict | None:
"""Download PDF and extract text."""
# Sanitise filename from circular number
safe_name = circular["circular_number"].replace("/", "_").replace(" ", "_")
pdf_path = PDF_DIR / f"{safe_name}.pdf"
text_path = TEXT_DIR / f"{safe_name}.txt"
# Skip if already processed
if text_path.exists():
return None
try:
# Download PDF
response = requests.get(circular["pdf_url"], headers=HEADERS, timeout=30)
response.raise_for_status()
pdf_path.write_bytes(response.content)
# Extract text
reader = pypdf.PdfReader(str(pdf_path))
text = "\n".join(page.extract_text() or "" for page in reader.pages)
text = text.strip()
if len(text) < 100: # skip nearly-empty extractions
return None
text_path.write_text(text, encoding="utf-8")
return {
**circular,
"text_path": str(text_path),
"char_count": len(text),
}
except Exception as e:
print(f"Failed to process {circular['circular_number']}: {e}")
return None
def collect_circulars(years: list[int] = None):
if years is None:
years = [2022, 2023, 2024, 2025, 2026]
all_circulars = []
for year in years:
print(f"Fetching {year} circular listings...")
circulars = get_circular_links(year)
print(f" Found {len(circulars)} circulars")
all_circulars.extend(circulars)
print(f"\nDownloading {len(all_circulars)} circulars...")
processed = []
for i, circular in enumerate(all_circulars):
result = download_and_extract(circular)
if result:
processed.append(result)
print(f" [{i+1}/{len(all_circulars)}] {circular['circular_number']}")
time.sleep(0.5) # be polite to RBI's servers
# Save manifest
manifest_path = DATA_DIR / "manifest.json"
manifest_path.write_text(json.dumps(processed, indent=2))
print(f"\nDone. Processed {len(processed)} circulars. Manifest at {manifest_path}")
if __name__ == "__main__":
collect_circulars()
Run this once to collect the data:
python collect_rbi.py
It'll take 15-30 minutes to download a few years of circulars. The script skips already-downloaded files, so you can interrupt and restart safely.
Step 2: Chunk and embed the documents
Chunking strategy matters a lot for regulatory documents. RBI circulars often have numbered sections with independent meaning — a 500-character chunk that cuts mid-paragraph loses context. We'll use RecursiveCharacterTextSplitter with overlapping chunks and preserve metadata so we can cite sources.
# build_index.py
import json
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
DATA_DIR = Path("./rbi_data")
CHROMA_DIR = "./rbi_chroma"
# Load manifest
manifest = json.loads((DATA_DIR / "manifest.json").read_text())
# Prepare documents with metadata
from langchain.schema import Document
documents = []
for circular in manifest:
text_path = Path(circular["text_path"])
if not text_path.exists():
continue
text = text_path.read_text(encoding="utf-8")
# Preserve structured metadata for citation
metadata = {
"circular_number": circular["circular_number"],
"date": circular["date"],
"subject": circular["subject"],
"year": str(circular["year"]),
"source": circular["pdf_url"],
}
documents.append(Document(page_content=text, metadata=metadata))
print(f"Loaded {len(documents)} circulars")
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
# Create embeddings using AICredits.in endpoint
# AICredits exposes an OpenAI-compatible endpoint — use text-embedding-3-small
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=os.environ["AICREDITS_API_KEY"],
openai_api_base="https://api.aicredits.in/v1",
)
# Build ChromaDB index
print("Building ChromaDB index (this takes a few minutes for large datasets)...")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=CHROMA_DIR,
)
vectorstore.persist()
print(f"Index saved to {CHROMA_DIR}")
For ~500 circulars, this typically takes 5-10 minutes and costs around ₹15-25 (text-embedding-3-small is very cheap). The ChromaDB index is saved locally — you only need to build it once.
Step 3: Build the Claude-powered QA system
# qa_system.py
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
CHROMA_DIR = "./rbi_chroma"
# Load the existing ChromaDB index
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=os.environ["AICREDITS_API_KEY"],
openai_api_base="https://api.aicredits.in/v1",
)
vectorstore = Chroma(
persist_directory=CHROMA_DIR,
embedding_function=embeddings,
)
# Claude Sonnet via AICredits.in
llm = ChatOpenAI(
model="anthropic/claude-sonnet-4-6",
openai_api_key=os.environ["AICREDITS_API_KEY"],
openai_api_base="https://api.aicredits.in/v1",
temperature=0, # deterministic for regulatory Q&A
)
# Custom prompt that keeps Claude grounded in the source material
PROMPT_TEMPLATE = """You are a regulatory research assistant specialising in RBI (Reserve Bank of India) circulars and policy documents.
Answer the question based ONLY on the provided RBI circular excerpts. Do not use any knowledge outside these excerpts.
If the answer is not found in the provided circulars, say: "I could not find specific guidance on this in the available circulars. You should check the RBI website directly at rbi.org.in for current guidance."
Always cite the circular number and date when you provide information.
Context from RBI circulars:
{context}
Question: {question}
Answer (cite circular numbers and dates):"""
PROMPT = PromptTemplate(
template=PROMPT_TEMPLATE,
input_variables=["context", "question"],
)
# Build the QA chain
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance — reduces redundant results
search_kwargs={"k": 5}, # retrieve 5 most relevant chunks
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT},
)
def ask_rbi(question: str) -> dict:
"""Query the RBI circular database."""
result = qa_chain.invoke({"query": question})
# Extract unique source circulars for display
sources = []
seen = set()
for doc in result["source_documents"]:
circular_num = doc.metadata.get("circular_number", "Unknown")
if circular_num not in seen:
seen.add(circular_num)
sources.append({
"circular": circular_num,
"date": doc.metadata.get("date", ""),
"subject": doc.metadata.get("subject", ""),
})
return {
"answer": result["result"],
"sources": sources,
}
if __name__ == "__main__":
# Quick test
result = ask_rbi("What are RBI's current regulations on digital lending apps?")
print(result["answer"])
print("\nSources used:")
for s in result["sources"]:
print(f" - {s['circular']} ({s['date']}): {s['subject']}")
The key choices here:
temperature=0— regulatory Q&A benefits from determinism. You don't want creative answers about banking regulations.- MMR retrieval — prevents the chain from retrieving 5 chunks from the same circular when multiple circulars are relevant
- The system prompt explicitly restricts Claude to the provided context. Without this, Claude will confidently answer from training data rather than citing the actual circulars.
Step 4: A simple web UI with Streamlit
# app.py
import streamlit as st
from qa_system import ask_rbi
st.set_page_config(
page_title="RBI Circular Search",
page_icon="🏦",
layout="wide",
)
st.title("RBI Circular Q&A")
st.markdown("Ask questions about RBI regulations. Answers are grounded in RBI circulars from 2022-2026.")
query = st.text_input(
"Your question",
placeholder="What did RBI say about BNPL regulations in 2024?",
)
if st.button("Search", type="primary") and query:
with st.spinner("Searching RBI circulars..."):
result = ask_rbi(query)
st.markdown("### Answer")
st.write(result["answer"])
if result["sources"]:
st.markdown("### Source circulars")
for source in result["sources"]:
with st.expander(f"{source['circular']} — {source['date']}"):
st.write(f"**Subject:** {source['subject']}")
Run it:
AICREDITS_API_KEY=your_key streamlit run app.py
Extending this to other datasets
DPIIT startup data
The DPIIT Startup India portal has a free API for registered startup data. Get an API key at startup.gov.in.
# Add to your MCP server or standalone script
import requests
def get_startup_stats(sector: str = None, state: str = None) -> dict:
"""Query DPIIT startup registry."""
DPIIT_API = "https://api.startupindia.gov.in/sih/api/search/profiles/startup"
params = {
"startupSector": sector,
"state": state,
"pageNo": 0,
"pageSize": 20,
}
headers = {
"Authorization": f"Bearer {os.environ['DPIIT_API_KEY']}",
"Content-Type": "application/json",
}
response = requests.get(DPIIT_API, params=params, headers=headers)
return response.json()
Build a chatbot that answers "How many fintech startups are registered in Maharashtra?" and the DPIIT data answers it precisely.
BSE company filings
BSE publishes quarterly results, annual reports, and corporate announcements as PDFs. The filing index is at https://www.bseindia.com/corporates/Comp_Resultsnew.aspx.
Apply the same RAG pattern: download PDFs, extract text, embed, build a QA chain. You end up with a system that can answer "What was Infosys' EBITDA margin in Q3 FY26?" from actual filings rather than relying on model training data.
The difference from the RBI system: financial documents have tables that don't extract cleanly from PDFs. You'll want to add a table extraction step (use camelot-py or pdfplumber for PDFs with tables rather than pypdf).
Try it now with AICredits.in
Access Claude, GPT-4o, Gemini, and 300+ models with UPI payment in ₹. No international card needed. Create free account →
Next steps
- RAG lesson — the conceptual foundation for everything we built here: embeddings, vector search, retrieval augmentation
- How RAG works — deeper technical dive into the retrieval mechanism
- Build a Python AI app with LangChain — full stack Python AI application tutorial
- Build a customer support AI agent — applying similar techniques to a different use case



