What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Embedding Models in 2026 — Which One to Use for RAG and Semantic Search

Picking the wrong embedding model costs you either quality or money — and you'll feel it the moment you try to switch. Changing embedding models means re-embedding every document in your index. At 10 million docs, that's a full reprocessing job, plus a vector store migration. So let's get this right the first time.

Embedding models are the foundation of RAG, semantic search, and recommendation systems. They convert text into dense vectors — lists of floats — so similarity search can find the most relevant chunks for a given query. The quality of those vectors determines retrieval quality. Everything else downstream (the LLM response, the citations, the factual accuracy) depends on whether the right chunks surfaced in the first place.

The six embedding models worth evaluating in 2026

Here's who's competing in this space right now.

OpenAI text-embedding-3-small — $0.02/1M tokens, 1536 dimensions, 8191 token context. The safe default for most projects. Well-tested, genuinely good multilingual support, widely benchmarked. If you're not sure what to pick, start here. At $0.02/1M tokens you can embed 50 million tokens for a dollar.

OpenAI text-embedding-3-large — $0.13/1M tokens, 3072 dimensions, same 8191 token context. Better quality on harder retrieval tasks — longer documents, more ambiguous queries, domain-specific content. Six and a half times the cost. Worth evaluating if 3-small retrieval quality isn't good enough, but run the benchmark first.

Cohere embed-v4 — $0.10/1M tokens, 1024 dimensions, 128k token context. Try Cohere →

The standout here is the 128k context window and native support for images and text in a single embedding. You can embed a page of text and an image together and query across both. That's genuinely useful for multimodal RAG — product catalogs, technical documentation with diagrams, anything where images carry meaning.

BGE-M3 — free (self-hosted), 1024 dimensions, 8192 token context. The best open-source embedding model on MTEB right now. Multiple retrieval modes: dense (standard vector similarity), sparse (keyword-weighted, like BM25), and multi-vector (ColBERT-style late interaction). Run locally with sentence-transformers. GPU recommended for production throughput; CPU works fine for smaller datasets.

nomic-embed-text — free via Ollama, 768 dimensions, 8192 context. Fastest local option by a wide margin. Great for development, prototyping, and privacy-sensitive applications where data can't leave the machine. Quality is lower than BGE-M3, but the developer experience is excellent: ollama pull nomic-embed-text and you're running.

Voyage AI voyage-3 — $0.06/1M tokens, 1024 dimensions. Strong on code and technical content. Anthropic invested in Voyage AI, and they use it internally. If your RAG index is full of code snippets, API documentation, or engineering content, voyage-3 is worth testing.

MTEB benchmark scores (2026 approximate)

MTEB (Massive Text Embedding Benchmark) is the standard for comparing embedding models across retrieval, clustering, classification, and more. These are approximate 2026 scores — treat them as directional, not authoritative, since your domain will differ.

Model	MTEB Avg	Retrieval	Cost/1M tokens
text-embedding-3-large	64.6	55.4	$0.13
Cohere embed-v4	64.5	55.9	$0.10
Voyage-3	63.8	55.2	$0.06
BGE-M3	63.2	54.3	Free
text-embedding-3-small	62.3	51.7	$0.02
nomic-embed-text	61.1	49.5	Free

The gap between top and bottom is about 6 points on average and 6.4 on retrieval. In practice, that translates to noticeably worse top-5 recall on hard queries — the kind where users search for something indirect and expect the system to figure it out. For FAQ chatbots over a small, well-structured knowledge base, the gap barely matters. For enterprise search over millions of mixed-format documents, it matters a lot.

Dimension tradeoffs

More dimensions means richer representation — but also more storage and slower queries.

768-dimensional vectors take half the RAM of 1536-dimensional ones. At 10 million documents, that's the difference between fitting the index in 30GB and needing 60GB. Chroma and FAISS are both in-memory by default. Pinecone bills partly by vector count and dimension. None of this matters at 50k docs. It starts mattering fast above 1M.

The accuracy improvement from more dimensions tends to flatten out. Going from 768 → 1536 dimensions is a meaningful jump. Going from 1536 → 3072 (text-embedding-3-large) gives you a smaller gain at more than double the storage and cost.

If you're building a vector store with Chroma or FAISS, pick the smallest dimensions that meet your quality bar. For Pinecone at scale, 1024-dimensional models (Cohere, BGE-M3, Voyage-3) hit a good sweet spot.

Context length matters more than people realize

Most tutorials embed 512-token chunks. At that chunk size, all these models are equivalent — you're nowhere near any context limit.

But sometimes you want to embed larger units. Full pages. Long product descriptions. Entire support tickets. Cohere embed-v4's 128k context window lets you embed a full research paper or a long contract as a single vector. That dramatically simplifies chunking strategy — you don't need to split and then figure out how to re-merge results.

For most use cases, 8192 tokens (BGE-M3, nomic-embed-text, OpenAI models) is plenty. Plan around 512-token chunks with some overlap, and you'll never hit the limit with any of these models.

When to use each model

Default / just starting out: text-embedding-3-small. Cheap, fast, excellent multilingual support, works with every major vector database. You can always upgrade later if quality benchmarks show it's not enough.

Best quality, budget available: text-embedding-3-large or Cohere embed-v4. Run both on your actual data with a small evaluation set before committing. They're close on MTEB, but domain-specific benchmarks can flip the ranking.

Multimodal RAG (images + text in one index): Cohere embed-v4, no contest. Nothing else in this list supports images natively.

Privacy / on-premise / no API costs: BGE-M3. It's the best open-source option and runs on a single GPU. With sentence-transformers, it's straightforward to self-host.

Local development and testing: nomic-embed-text via Ollama. Fastest setup, no API key, no data leaving your machine.

Code-heavy technical content: Voyage-3. Test it against text-embedding-3-small on a sample of your actual code/doc content — the benchmark gap is real for this domain.

Code examples

BGE-M3 with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")
embeddings = model.encode(
    ["What is RAG?", "Retrieval-augmented generation combines search with LLMs"],
    normalize_embeddings=True  # Required for cosine similarity
)
# embeddings.shape: (2, 1024)

Note the normalize_embeddings=True — without it, cosine similarity scores won't be correct. BGE-M3 also exposes a BGEM3FlagModel class if you want access to the sparse and multi-vector retrieval modes, which require a different query interface.

OpenAI text-embedding-3-small:

from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Your text here"],
)
embedding = response.data[0].embedding  # list of 1536 floats

For batch embedding, pass a list of strings to input. The API handles batching up to 2048 inputs per request. Don't call it one document at a time.

Cost reality at scale

Let's make this concrete. 10 million documents, average 500 tokens each = 5 billion tokens total.

text-embedding-3-small: $100
Voyage-3: $300
Cohere embed-v4: $500
text-embedding-3-large: $650
BGE-M3: GPU time only — around $10-30 of cloud compute if you batch efficiently

The gap is large. For a one-time embedding job, it might not matter. For a system that re-embeds frequently (real-time updates, document revisions), the ongoing cost adds up.

The one thing you can't do: mix models

You cannot store embeddings from different models in the same vector index and query across them. They live in different semantic spaces — a similarity score between a text-embedding-3-small vector and a BGE-M3 vector is meaningless.

If you switch models after you've built your index, you re-embed everything. There's no shortcut.

This makes the initial model choice load-bearing. If you're building a RAG pipeline for production, pick based on your actual requirements: What's your privacy posture? What's your expected document count in 12 months? Do you have images? Is the content highly technical?

For most teams: start with text-embedding-3-small, evaluate on a real sample of your data, and only upgrade if the quality metrics show you need to. The MTEB rankings are a starting point, not a verdict. Your domain is different from the benchmark domains.

If you're already running RAG and wondering why retrieval quality is inconsistent, check agentic RAG patterns — sometimes the issue isn't the embedding model, it's the retrieval strategy. And if you're deciding between building a RAG system vs. fine-tuning vs. just prompting, this comparison breaks down the tradeoffs.

The six embedding models worth evaluating in 2026

Here's who's competing in this space right now.

Cohere embed-v4 — $0.10/1M tokens, 1024 dimensions, 128k token context. Try Cohere →

MTEB benchmark scores (2026 approximate)

Model	MTEB Avg	Retrieval	Cost/1M tokens
text-embedding-3-large	64.6	55.4	$0.13
Cohere embed-v4	64.5	55.9	$0.10
Voyage-3	63.8	55.2	$0.06
BGE-M3	63.2	54.3	Free
text-embedding-3-small	62.3	51.7	$0.02
nomic-embed-text	61.1	49.5	Free

Dimension tradeoffs

More dimensions means richer representation — but also more storage and slower queries.

Context length matters more than people realize

Most tutorials embed 512-token chunks. At that chunk size, all these models are equivalent — you're nowhere near any context limit.

For most use cases, 8192 tokens (BGE-M3, nomic-embed-text, OpenAI models) is plenty. Plan around 512-token chunks with some overlap, and you'll never hit the limit with any of these models.

When to use each model

Multimodal RAG (images + text in one index): Cohere embed-v4, no contest. Nothing else in this list supports images natively.

Privacy / on-premise / no API costs: BGE-M3. It's the best open-source option and runs on a single GPU. With sentence-transformers, it's straightforward to self-host.

Local development and testing: nomic-embed-text via Ollama. Fastest setup, no API key, no data leaving your machine.

Code-heavy technical content: Voyage-3. Test it against text-embedding-3-small on a sample of your actual code/doc content — the benchmark gap is real for this domain.

Code examples

BGE-M3 with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")
embeddings = model.encode(
    ["What is RAG?", "Retrieval-augmented generation combines search with LLMs"],
    normalize_embeddings=True  # Required for cosine similarity
)
# embeddings.shape: (2, 1024)

OpenAI text-embedding-3-small:

from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Your text here"],
)
embedding = response.data[0].embedding  # list of 1536 floats

For batch embedding, pass a list of strings to input. The API handles batching up to 2048 inputs per request. Don't call it one document at a time.

Cost reality at scale

Let's make this concrete. 10 million documents, average 500 tokens each = 5 billion tokens total.

text-embedding-3-small: $100
Voyage-3: $300
Cohere embed-v4: $500
text-embedding-3-large: $650
BGE-M3: GPU time only — around $10-30 of cloud compute if you batch efficiently

The gap is large. For a one-time embedding job, it might not matter. For a system that re-embeds frequently (real-time updates, document revisions), the ongoing cost adds up.

The one thing you can't do: mix models

If you switch models after you've built your index, you re-embed everything. There's no shortcut.

Embedding Models in 2026 — Which One to Use for RAG and Semantic Search

The six embedding models worth evaluating in 2026

MTEB benchmark scores (2026 approximate)

Dimension tradeoffs

Context length matters more than people realize

When to use each model

Code examples

Cost reality at scale

The one thing you can't do: mix models

Related articles

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Instructor Library — The Best Way to Get Structured Outputs from Any LLM

LlamaIndex vs LangChain for RAG in 2026 — A Code-First Comparison

Embedding Models in 2026 — Which One to Use for RAG and Semantic Search

The six embedding models worth evaluating in 2026

MTEB benchmark scores (2026 approximate)

Dimension tradeoffs

Context length matters more than people realize

When to use each model

Code examples

Cost reality at scale

The one thing you can't do: mix models

Related articles

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Instructor Library — The Best Way to Get Structured Outputs from Any LLM

LlamaIndex vs LangChain for RAG in 2026 — A Code-First Comparison