What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

The vector store is the infrastructure decision in a RAG system that's hardest to change after the fact. Switching embedding models is annoying. Switching vector stores after you've written all your ingestion code, metadata schemas, and query logic — that's a significant refactor. Get it right upfront.

There are three options that actually matter for most Python RAG projects: FAISS, Chroma, and Pinecone. Each one occupies a different position on the simplicity-vs-scale spectrum. I'll show you how to build the same pipeline in all three, then give you the honest tradeoffs.

Understanding the three options

FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings. No server, no dependencies beyond the library itself, runs in-process. Everything lives in memory. It's the fastest option by far on a single machine — sub-millisecond queries at 100k vectors. The tradeoff: FAISS only stores vectors. You manage document text separately, and you manage persistence by serializing the index to disk yourself.

Chroma is a vector database that runs as a local embedded database (or self-hosted server). It handles persistence automatically, supports metadata storage alongside vectors, and has a clean Python API. Slightly slower than FAISS due to the database overhead, but vastly more ergonomic for anything beyond a prototype.

Pinecone is our recommended managed vector database for production RAG. Get started with Pinecone →

Pinecone is a fully managed cloud vector database. You don't run any infrastructure — create an index, upsert vectors, query. It scales horizontally, supports real-time updates at high concurrency, and has built-in metadata filtering. It costs money ($70/month for a million vectors in a standard serverless index). Worth it in production; overkill for development.

Build the same RAG pipeline in all three

I'll use OpenAI's text-embedding-3-small for the embeddings (1536 dimensions, $0.02/1M tokens). The embedding model choice is independent of the vector store — see the embedding models comparison for help picking the right one.

FAISS

import faiss
import numpy as np
import json
from openai import OpenAI

client = OpenAI()

def embed(texts: list[str]) -> np.ndarray:
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([r.embedding for r in response.data], dtype="float32")

# Build index
docs = ["doc1 text", "doc2 text", ...]  # your 100 documents
embeddings = embed(docs)

dimension = embeddings.shape[1]  # 1536 for text-embedding-3-small
index = faiss.IndexFlatIP(dimension)  # Inner product = cosine similarity on normalized vectors
faiss.normalize_L2(embeddings)
index.add(embeddings)

# Store doc text alongside (FAISS only stores vectors, not metadata)
with open("doc_store.json", "w") as f:
    json.dump(docs, f)

# Query
query_embedding = embed(["What is RAG?"])
faiss.normalize_L2(query_embedding)
distances, indices = index.search(query_embedding, k=5)
results = [docs[i] for i in indices[0]]

Two things that trip people up here. First: IndexFlatIP does inner product similarity. After normalizing with faiss.normalize_L2(), inner product equals cosine similarity. If you skip normalization and still use IndexFlatIP, your scores will be wrong. Second: FAISS stores vector IDs (integer indices) only — no text, no metadata. The doc_store.json pattern is the standard workaround. In production you'd use a real database (Postgres, SQLite) keyed by the integer index.

Saving and loading the index:

faiss.write_index(index, "index.faiss")

# Later
index = faiss.read_index("index.faiss")

Without this, every restart rebuilds from scratch. Call write_index after every batch of upserts.

For approximate nearest neighbor search at scale (>1M vectors), swap IndexFlatIP for IndexIVFFlat or IndexHNSWFlat. These trade a small amount of recall for dramatically faster queries. At 100k vectors, flat search is fine.

Chroma

import chromadb
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="docs",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}
)

# Ingest
collection.add(
    documents=["doc1 text", "doc2 text", ...],
    ids=["doc1", "doc2", ...],
    metadatas=[{"source": "manual.pdf", "page": 1}, ...]
)

# Query
results = collection.query(
    query_texts=["What is RAG?"],
    n_results=5,
    where={"source": "manual.pdf"}  # metadata filtering
)

Notice the metadata={"hnsw:space": "cosine"} on collection creation — without this, Chroma defaults to L2 distance. For OpenAI embeddings, cosine is the right choice.

The PersistentClient writes to disk automatically at ./chroma_db. No explicit save calls. The where clause in query() is metadata filtering — you can pre-filter by source, date, document type, or any field you stored in metadatas. This is one of the main advantages over FAISS.

Chroma also accepts query_texts directly and calls the embedding function for you. Fewer lines of code than FAISS for the same pipeline.

Chroma in client-server mode (for multi-process access):

# Terminal: chroma run --path ./chroma_db --port 8000

client = chromadb.HttpClient(host="localhost", port=8000)

Running Chroma as a server on a Hostinger KVM 2 VPS (~₹700/month) gives you a persistent, network-accessible vector store without paying Pinecone's managed service fees. Works well up to ~500k vectors.

Same API, runs as a separate process. Good for FastAPI apps where multiple workers need to share the same index.

Pinecone

from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI

pc = Pinecone(api_key="your-pinecone-key")
openai_client = OpenAI()

# Create index (one-time setup)
pc.create_index(
    name="docs",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("docs")

# Embed and ingest
def embed(text: str) -> list[float]:
    return openai_client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

vectors = [
    {"id": f"doc{i}", "values": embed(doc), "metadata": {"text": doc, "source": "manual.pdf"}}
    for i, doc in enumerate(docs)
]
index.upsert(vectors=vectors, namespace="production")

# Query
query_embedding = embed("What is RAG?")
results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True,
    filter={"source": {"$eq": "manual.pdf"}}
)

Pinecone's filter syntax is MongoDB-style ($eq, $in, $gte). The namespace parameter lets you partition an index — useful for multi-tenant apps where you want to query only the vectors for a specific customer.

One gotcha: Pinecone's serverless indexes have a warm-up period on first query. In production, send a dummy query on startup so the real first query isn't slow.

For batch ingestion, upsert accepts up to 100 vectors per call. For 10k+ documents, split into batches:

batch_size = 100
for i in range(0, len(vectors), batch_size):
    index.upsert(vectors=vectors[i:i+batch_size], namespace="production")

Performance comparison

	FAISS	Chroma	Pinecone
Query latency (100k vectors)	<1ms	5–20ms	10–50ms
Query latency (10M vectors)	50–200ms	Not recommended	10–50ms
Setup time	5 minutes	10 minutes	15 minutes
Monthly cost (1M vectors)	$0	$0 (self-hosted)	~$70
Metadata filtering	No (manual)	Yes	Yes
Real-time updates	Yes	Yes	Yes
Persistence	Manual (save/load)	Automatic	Automatic
Horizontal scale	No	No	Yes

FAISS wins on raw latency because it's in-process — no network round-trip, no database overhead. That advantage disappears at 10M vectors without approximate indexing, and it disappears entirely once you add network latency in a real app.

When to upgrade

From FAISS to Chroma: when you need metadata filtering and don't want to maintain a parallel document store, or when you're tired of manually calling write_index and occasionally losing changes on a crash.

From Chroma to Pinecone: when you hit ~500k vectors and query latency starts climbing, when you need multi-region availability, or when you're running multiple application instances that need concurrent write access to the same index.

Hybrid search

All three vector stores support dense vector search out of the box. For hybrid search (combining dense similarity with keyword/BM25 matching), the approaches differ.

Pinecone supports hybrid search natively — you can pass both a dense vector and a sparse vector in a single query, and it blends the results using an alpha parameter (0 = all sparse/BM25, 1 = all dense).

For FAISS and Chroma, you combine them with rank_bm25:

from rank_bm25 import BM25Okapi

tokenized_docs = [doc.split() for doc in docs]
bm25 = BM25Okapi(tokenized_docs)

def hybrid_search(query: str, k: int = 5) -> list[str]:
    # Dense retrieval
    query_emb = embed([query])
    faiss.normalize_L2(query_emb)
    _, dense_indices = index.search(query_emb, k * 2)
    
    # Sparse retrieval
    sparse_scores = bm25.get_scores(query.split())
    sparse_indices = np.argsort(sparse_scores)[::-1][:k * 2]
    
    # Reciprocal rank fusion
    scores = {}
    for rank, idx in enumerate(dense_indices[0]):
        scores[idx] = scores.get(idx, 0) + 1 / (rank + 60)
    for rank, idx in enumerate(sparse_indices):
        scores[idx] = scores.get(idx, 0) + 1 / (rank + 60)
    
    top_k = sorted(scores, key=scores.get, reverse=True)[:k]
    return [docs[i] for i in top_k]

The 60 in the denominator is the standard RRF constant — it smooths out rank differences between the two systems.

The decision in one sentence each

Use FAISS if you want the simplest possible setup and you're comfortable managing persistence and metadata yourself.

Use Chroma if you want a real database with metadata filtering and don't want to run managed infrastructure.

Use Pinecone if you're in production, you need scale, and you'd rather pay for infrastructure than maintain it.

For most projects, start with Chroma — it's the right balance of simplicity and capability. Graduate to Pinecone when Chroma's limitations become real problems, not hypothetical ones. And if you're doing agentic RAG where the retrieval strategy itself is dynamic, the vector store choice matters less than how you're querying it.

For a full comparison of build-vs-buy decisions for LLM infrastructure, fine-tuning vs RAG vs prompting covers the higher-level tradeoffs. And if you're thinking about caching expensive query embeddings to reduce latency and API costs, semantic caching with Redis and GPTCache is worth reading next.

Understanding the three options

Pinecone is our recommended managed vector database for production RAG. Get started with Pinecone →

Build the same RAG pipeline in all three

FAISS

import faiss
import numpy as np
import json
from openai import OpenAI

client = OpenAI()

def embed(texts: list[str]) -> np.ndarray:
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([r.embedding for r in response.data], dtype="float32")

# Build index
docs = ["doc1 text", "doc2 text", ...]  # your 100 documents
embeddings = embed(docs)

dimension = embeddings.shape[1]  # 1536 for text-embedding-3-small
index = faiss.IndexFlatIP(dimension)  # Inner product = cosine similarity on normalized vectors
faiss.normalize_L2(embeddings)
index.add(embeddings)

# Store doc text alongside (FAISS only stores vectors, not metadata)
with open("doc_store.json", "w") as f:
    json.dump(docs, f)

# Query
query_embedding = embed(["What is RAG?"])
faiss.normalize_L2(query_embedding)
distances, indices = index.search(query_embedding, k=5)
results = [docs[i] for i in indices[0]]

Saving and loading the index:

faiss.write_index(index, "index.faiss")

# Later
index = faiss.read_index("index.faiss")

Without this, every restart rebuilds from scratch. Call write_index after every batch of upserts.

Chroma

import chromadb
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="docs",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}
)

# Ingest
collection.add(
    documents=["doc1 text", "doc2 text", ...],
    ids=["doc1", "doc2", ...],
    metadatas=[{"source": "manual.pdf", "page": 1}, ...]
)

# Query
results = collection.query(
    query_texts=["What is RAG?"],
    n_results=5,
    where={"source": "manual.pdf"}  # metadata filtering
)

Notice the metadata={"hnsw:space": "cosine"} on collection creation — without this, Chroma defaults to L2 distance. For OpenAI embeddings, cosine is the right choice.

Chroma also accepts query_texts directly and calls the embedding function for you. Fewer lines of code than FAISS for the same pipeline.

Chroma in client-server mode (for multi-process access):

# Terminal: chroma run --path ./chroma_db --port 8000

client = chromadb.HttpClient(host="localhost", port=8000)

Same API, runs as a separate process. Good for FastAPI apps where multiple workers need to share the same index.

Pinecone

from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI

pc = Pinecone(api_key="your-pinecone-key")
openai_client = OpenAI()

# Create index (one-time setup)
pc.create_index(
    name="docs",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("docs")

# Embed and ingest
def embed(text: str) -> list[float]:
    return openai_client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

vectors = [
    {"id": f"doc{i}", "values": embed(doc), "metadata": {"text": doc, "source": "manual.pdf"}}
    for i, doc in enumerate(docs)
]
index.upsert(vectors=vectors, namespace="production")

# Query
query_embedding = embed("What is RAG?")
results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True,
    filter={"source": {"$eq": "manual.pdf"}}
)

One gotcha: Pinecone's serverless indexes have a warm-up period on first query. In production, send a dummy query on startup so the real first query isn't slow.

For batch ingestion, upsert accepts up to 100 vectors per call. For 10k+ documents, split into batches:

batch_size = 100
for i in range(0, len(vectors), batch_size):
    index.upsert(vectors=vectors[i:i+batch_size], namespace="production")

Performance comparison

	FAISS	Chroma	Pinecone
Query latency (100k vectors)	<1ms	5–20ms	10–50ms
Query latency (10M vectors)	50–200ms	Not recommended	10–50ms
Setup time	5 minutes	10 minutes	15 minutes
Monthly cost (1M vectors)	$0	$0 (self-hosted)	~$70
Metadata filtering	No (manual)	Yes	Yes
Real-time updates	Yes	Yes	Yes
Persistence	Manual (save/load)	Automatic	Automatic
Horizontal scale	No	No	Yes

When to upgrade

Hybrid search

All three vector stores support dense vector search out of the box. For hybrid search (combining dense similarity with keyword/BM25 matching), the approaches differ.

For FAISS and Chroma, you combine them with rank_bm25:

from rank_bm25 import BM25Okapi

tokenized_docs = [doc.split() for doc in docs]
bm25 = BM25Okapi(tokenized_docs)

def hybrid_search(query: str, k: int = 5) -> list[str]:
    # Dense retrieval
    query_emb = embed([query])
    faiss.normalize_L2(query_emb)
    _, dense_indices = index.search(query_emb, k * 2)
    
    # Sparse retrieval
    sparse_scores = bm25.get_scores(query.split())
    sparse_indices = np.argsort(sparse_scores)[::-1][:k * 2]
    
    # Reciprocal rank fusion
    scores = {}
    for rank, idx in enumerate(dense_indices[0]):
        scores[idx] = scores.get(idx, 0) + 1 / (rank + 60)
    for rank, idx in enumerate(sparse_indices):
        scores[idx] = scores.get(idx, 0) + 1 / (rank + 60)
    
    top_k = sorted(scores, key=scores.get, reverse=True)[:k]
    return [docs[i] for i in top_k]

The 60 in the denominator is the standard RRF constant — it smooths out rank differences between the two systems.

The decision in one sentence each

Use FAISS if you want the simplest possible setup and you're comfortable managing persistence and metadata yourself.

Use Chroma if you want a real database with metadata filtering and don't want to run managed infrastructure.

Use Pinecone if you're in production, you need scale, and you'd rather pay for infrastructure than maintain it.

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Understanding the three options

Build the same RAG pipeline in all three

FAISS

Chroma

Pinecone

Performance comparison

When to upgrade

Hybrid search

The decision in one sentence each

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Claude API vs OpenAI API — Developer Comparison Guide (2026)

Claude Vision API — Complete Guide to Image Analysis and Understanding

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Understanding the three options

Build the same RAG pipeline in all three

FAISS

Chroma

Pinecone

Performance comparison

When to upgrade

Hybrid search

The decision in one sentence each

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Claude API vs OpenAI API — Developer Comparison Guide (2026)

Claude Vision API — Complete Guide to Image Analysis and Understanding