The vector store is the infrastructure decision in a RAG system that's hardest to change after the fact. Switching embedding models is annoying. Switching vector stores after you've written all your ingestion code, metadata schemas, and query logic — that's a significant refactor. Get it right upfront.
There are three options that actually matter for most Python RAG projects: FAISS, Chroma, and Pinecone. Each one occupies a different position on the simplicity-vs-scale spectrum. I'll show you how to build the same pipeline in all three, then give you the honest tradeoffs.
Understanding the three options
FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings. No server, no dependencies beyond the library itself, runs in-process. Everything lives in memory. It's the fastest option by far on a single machine — sub-millisecond queries at 100k vectors. The tradeoff: FAISS only stores vectors. You manage document text separately, and you manage persistence by serializing the index to disk yourself.
Chroma is a vector database that runs as a local embedded database (or self-hosted server). It handles persistence automatically, supports metadata storage alongside vectors, and has a clean Python API. Slightly slower than FAISS due to the database overhead, but vastly more ergonomic for anything beyond a prototype.
Pinecone is our recommended managed vector database for production RAG. Get started with Pinecone →
Pinecone is a fully managed cloud vector database. You don't run any infrastructure — create an index, upsert vectors, query. It scales horizontally, supports real-time updates at high concurrency, and has built-in metadata filtering. It costs money ($70/month for a million vectors in a standard serverless index). Worth it in production; overkill for development.
Build the same RAG pipeline in all three
I'll use OpenAI's text-embedding-3-small for the embeddings (1536 dimensions, $0.02/1M tokens). The embedding model choice is independent of the vector store — see the embedding models comparison for help picking the right one.
FAISS
import faiss
import numpy as np
import json
from openai import OpenAI
client = OpenAI()
def embed(texts: list[str]) -> np.ndarray:
response = client.embeddings.create(model="text-embedding-3-small", input=texts)
return np.array([r.embedding for r in response.data], dtype="float32")
# Build index
docs = ["doc1 text", "doc2 text", ...] # your 100 documents
embeddings = embed(docs)
dimension = embeddings.shape[1] # 1536 for text-embedding-3-small
index = faiss.IndexFlatIP(dimension) # Inner product = cosine similarity on normalized vectors
faiss.normalize_L2(embeddings)
index.add(embeddings)
# Store doc text alongside (FAISS only stores vectors, not metadata)
with open("doc_store.json", "w") as f:
json.dump(docs, f)
# Query
query_embedding = embed(["What is RAG?"])
faiss.normalize_L2(query_embedding)
distances, indices = index.search(query_embedding, k=5)
results = [docs[i] for i in indices[0]]
Two things that trip people up here. First: IndexFlatIP does inner product similarity. After normalizing with faiss.normalize_L2(), inner product equals cosine similarity. If you skip normalization and still use IndexFlatIP, your scores will be wrong. Second: FAISS stores vector IDs (integer indices) only — no text, no metadata. The doc_store.json pattern is the standard workaround. In production you'd use a real database (Postgres, SQLite) keyed by the integer index.
Saving and loading the index:
faiss.write_index(index, "index.faiss")
# Later
index = faiss.read_index("index.faiss")
Without this, every restart rebuilds from scratch. Call write_index after every batch of upserts.
For approximate nearest neighbor search at scale (>1M vectors), swap IndexFlatIP for IndexIVFFlat or IndexHNSWFlat. These trade a small amount of recall for dramatically faster queries. At 100k vectors, flat search is fine.
Chroma
import chromadb
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="docs",
embedding_function=openai_ef,
metadata={"hnsw:space": "cosine"}
)
# Ingest
collection.add(
documents=["doc1 text", "doc2 text", ...],
ids=["doc1", "doc2", ...],
metadatas=[{"source": "manual.pdf", "page": 1}, ...]
)
# Query
results = collection.query(
query_texts=["What is RAG?"],
n_results=5,
where={"source": "manual.pdf"} # metadata filtering
)
Notice the metadata={"hnsw:space": "cosine"} on collection creation — without this, Chroma defaults to L2 distance. For OpenAI embeddings, cosine is the right choice.
The PersistentClient writes to disk automatically at ./chroma_db. No explicit save calls. The where clause in query() is metadata filtering — you can pre-filter by source, date, document type, or any field you stored in metadatas. This is one of the main advantages over FAISS.
Chroma also accepts query_texts directly and calls the embedding function for you. Fewer lines of code than FAISS for the same pipeline.
Chroma in client-server mode (for multi-process access):
# Terminal: chroma run --path ./chroma_db --port 8000
client = chromadb.HttpClient(host="localhost", port=8000)
Running Chroma as a server on a Hostinger KVM 2 VPS (~₹700/month) gives you a persistent, network-accessible vector store without paying Pinecone's managed service fees. Works well up to ~500k vectors.
Same API, runs as a separate process. Good for FastAPI apps where multiple workers need to share the same index.
Pinecone
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI
pc = Pinecone(api_key="your-pinecone-key")
openai_client = OpenAI()
# Create index (one-time setup)
pc.create_index(
name="docs",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("docs")
# Embed and ingest
def embed(text: str) -> list[float]:
return openai_client.embeddings.create(
model="text-embedding-3-small", input=text
).data[0].embedding
vectors = [
{"id": f"doc{i}", "values": embed(doc), "metadata": {"text": doc, "source": "manual.pdf"}}
for i, doc in enumerate(docs)
]
index.upsert(vectors=vectors, namespace="production")
# Query
query_embedding = embed("What is RAG?")
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True,
filter={"source": {"$eq": "manual.pdf"}}
)
Pinecone's filter syntax is MongoDB-style ($eq, $in, $gte). The namespace parameter lets you partition an index — useful for multi-tenant apps where you want to query only the vectors for a specific customer.
One gotcha: Pinecone's serverless indexes have a warm-up period on first query. In production, send a dummy query on startup so the real first query isn't slow.
For batch ingestion, upsert accepts up to 100 vectors per call. For 10k+ documents, split into batches:
batch_size = 100
for i in range(0, len(vectors), batch_size):
index.upsert(vectors=vectors[i:i+batch_size], namespace="production")
Performance comparison
| FAISS | Chroma | Pinecone | |
|---|---|---|---|
| Query latency (100k vectors) | <1ms | 5–20ms | 10–50ms |
| Query latency (10M vectors) | 50–200ms | Not recommended | 10–50ms |
| Setup time | 5 minutes | 10 minutes | 15 minutes |
| Monthly cost (1M vectors) | $0 | $0 (self-hosted) | ~$70 |
| Metadata filtering | No (manual) | Yes | Yes |
| Real-time updates | Yes | Yes | Yes |
| Persistence | Manual (save/load) | Automatic | Automatic |
| Horizontal scale | No | No | Yes |
FAISS wins on raw latency because it's in-process — no network round-trip, no database overhead. That advantage disappears at 10M vectors without approximate indexing, and it disappears entirely once you add network latency in a real app.
When to upgrade
From FAISS to Chroma: when you need metadata filtering and don't want to maintain a parallel document store, or when you're tired of manually calling write_index and occasionally losing changes on a crash.
From Chroma to Pinecone: when you hit ~500k vectors and query latency starts climbing, when you need multi-region availability, or when you're running multiple application instances that need concurrent write access to the same index.
Hybrid search
All three vector stores support dense vector search out of the box. For hybrid search (combining dense similarity with keyword/BM25 matching), the approaches differ.
Pinecone supports hybrid search natively — you can pass both a dense vector and a sparse vector in a single query, and it blends the results using an alpha parameter (0 = all sparse/BM25, 1 = all dense).
For FAISS and Chroma, you combine them with rank_bm25:
from rank_bm25 import BM25Okapi
tokenized_docs = [doc.split() for doc in docs]
bm25 = BM25Okapi(tokenized_docs)
def hybrid_search(query: str, k: int = 5) -> list[str]:
# Dense retrieval
query_emb = embed([query])
faiss.normalize_L2(query_emb)
_, dense_indices = index.search(query_emb, k * 2)
# Sparse retrieval
sparse_scores = bm25.get_scores(query.split())
sparse_indices = np.argsort(sparse_scores)[::-1][:k * 2]
# Reciprocal rank fusion
scores = {}
for rank, idx in enumerate(dense_indices[0]):
scores[idx] = scores.get(idx, 0) + 1 / (rank + 60)
for rank, idx in enumerate(sparse_indices):
scores[idx] = scores.get(idx, 0) + 1 / (rank + 60)
top_k = sorted(scores, key=scores.get, reverse=True)[:k]
return [docs[i] for i in top_k]
The 60 in the denominator is the standard RRF constant — it smooths out rank differences between the two systems.
The decision in one sentence each
Use FAISS if you want the simplest possible setup and you're comfortable managing persistence and metadata yourself.
Use Chroma if you want a real database with metadata filtering and don't want to run managed infrastructure.
Use Pinecone if you're in production, you need scale, and you'd rather pay for infrastructure than maintain it.
For most projects, start with Chroma — it's the right balance of simplicity and capability. Graduate to Pinecone when Chroma's limitations become real problems, not hypothetical ones. And if you're doing agentic RAG where the retrieval strategy itself is dynamic, the vector store choice matters less than how you're querying it.
For a full comparison of build-vs-buy decisions for LLM infrastructure, fine-tuning vs RAG vs prompting covers the higher-level tradeoffs. And if you're thinking about caching expensive query embeddings to reduce latency and API costs, semantic caching with Redis and GPTCache is worth reading next.



