Picking the wrong embedding model costs you either quality or money — and you'll feel it the moment you try to switch. Changing embedding models means re-embedding every document in your index. At 10 million docs, that's a full reprocessing job, plus a vector store migration. So let's get this right the first time.
Embedding models are the foundation of RAG, semantic search, and recommendation systems. They convert text into dense vectors — lists of floats — so similarity search can find the most relevant chunks for a given query. The quality of those vectors determines retrieval quality. Everything else downstream (the LLM response, the citations, the factual accuracy) depends on whether the right chunks surfaced in the first place.
The six embedding models worth evaluating in 2026
Here's who's competing in this space right now.
OpenAI text-embedding-3-small — $0.02/1M tokens, 1536 dimensions, 8191 token context. The safe default for most projects. Well-tested, genuinely good multilingual support, widely benchmarked. If you're not sure what to pick, start here. At $0.02/1M tokens you can embed 50 million tokens for a dollar.
OpenAI text-embedding-3-large — $0.13/1M tokens, 3072 dimensions, same 8191 token context. Better quality on harder retrieval tasks — longer documents, more ambiguous queries, domain-specific content. Six and a half times the cost. Worth evaluating if 3-small retrieval quality isn't good enough, but run the benchmark first.
Cohere embed-v4 — $0.10/1M tokens, 1024 dimensions, 128k token context. Try Cohere →
The standout here is the 128k context window and native support for images and text in a single embedding. You can embed a page of text and an image together and query across both. That's genuinely useful for multimodal RAG — product catalogs, technical documentation with diagrams, anything where images carry meaning.
BGE-M3 — free (self-hosted), 1024 dimensions, 8192 token context. The best open-source embedding model on MTEB right now. Multiple retrieval modes: dense (standard vector similarity), sparse (keyword-weighted, like BM25), and multi-vector (ColBERT-style late interaction). Run locally with sentence-transformers. GPU recommended for production throughput; CPU works fine for smaller datasets.
nomic-embed-text — free via Ollama, 768 dimensions, 8192 context. Fastest local option by a wide margin. Great for development, prototyping, and privacy-sensitive applications where data can't leave the machine. Quality is lower than BGE-M3, but the developer experience is excellent: ollama pull nomic-embed-text and you're running.
Voyage AI voyage-3 — $0.06/1M tokens, 1024 dimensions. Strong on code and technical content. Anthropic invested in Voyage AI, and they use it internally. If your RAG index is full of code snippets, API documentation, or engineering content, voyage-3 is worth testing.
MTEB benchmark scores (2026 approximate)
MTEB (Massive Text Embedding Benchmark) is the standard for comparing embedding models across retrieval, clustering, classification, and more. These are approximate 2026 scores — treat them as directional, not authoritative, since your domain will differ.
| Model | MTEB Avg | Retrieval | Cost/1M tokens |
|---|---|---|---|
| text-embedding-3-large | 64.6 | 55.4 | $0.13 |
| Cohere embed-v4 | 64.5 | 55.9 | $0.10 |
| Voyage-3 | 63.8 | 55.2 | $0.06 |
| BGE-M3 | 63.2 | 54.3 | Free |
| text-embedding-3-small | 62.3 | 51.7 | $0.02 |
| nomic-embed-text | 61.1 | 49.5 | Free |
The gap between top and bottom is about 6 points on average and 6.4 on retrieval. In practice, that translates to noticeably worse top-5 recall on hard queries — the kind where users search for something indirect and expect the system to figure it out. For FAQ chatbots over a small, well-structured knowledge base, the gap barely matters. For enterprise search over millions of mixed-format documents, it matters a lot.
Dimension tradeoffs
More dimensions means richer representation — but also more storage and slower queries.
768-dimensional vectors take half the RAM of 1536-dimensional ones. At 10 million documents, that's the difference between fitting the index in 30GB and needing 60GB. Chroma and FAISS are both in-memory by default. Pinecone bills partly by vector count and dimension. None of this matters at 50k docs. It starts mattering fast above 1M.
The accuracy improvement from more dimensions tends to flatten out. Going from 768 → 1536 dimensions is a meaningful jump. Going from 1536 → 3072 (text-embedding-3-large) gives you a smaller gain at more than double the storage and cost.
If you're building a vector store with Chroma or FAISS, pick the smallest dimensions that meet your quality bar. For Pinecone at scale, 1024-dimensional models (Cohere, BGE-M3, Voyage-3) hit a good sweet spot.
Context length matters more than people realize
Most tutorials embed 512-token chunks. At that chunk size, all these models are equivalent — you're nowhere near any context limit.
But sometimes you want to embed larger units. Full pages. Long product descriptions. Entire support tickets. Cohere embed-v4's 128k context window lets you embed a full research paper or a long contract as a single vector. That dramatically simplifies chunking strategy — you don't need to split and then figure out how to re-merge results.
For most use cases, 8192 tokens (BGE-M3, nomic-embed-text, OpenAI models) is plenty. Plan around 512-token chunks with some overlap, and you'll never hit the limit with any of these models.
When to use each model
Default / just starting out: text-embedding-3-small. Cheap, fast, excellent multilingual support, works with every major vector database. You can always upgrade later if quality benchmarks show it's not enough.
Best quality, budget available: text-embedding-3-large or Cohere embed-v4. Run both on your actual data with a small evaluation set before committing. They're close on MTEB, but domain-specific benchmarks can flip the ranking.
Multimodal RAG (images + text in one index): Cohere embed-v4, no contest. Nothing else in this list supports images natively.
Privacy / on-premise / no API costs: BGE-M3. It's the best open-source option and runs on a single GPU. With sentence-transformers, it's straightforward to self-host.
Local development and testing: nomic-embed-text via Ollama. Fastest setup, no API key, no data leaving your machine.
Code-heavy technical content: Voyage-3. Test it against text-embedding-3-small on a sample of your actual code/doc content — the benchmark gap is real for this domain.
Code examples
BGE-M3 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
embeddings = model.encode(
["What is RAG?", "Retrieval-augmented generation combines search with LLMs"],
normalize_embeddings=True # Required for cosine similarity
)
# embeddings.shape: (2, 1024)
Note the normalize_embeddings=True — without it, cosine similarity scores won't be correct. BGE-M3 also exposes a BGEM3FlagModel class if you want access to the sparse and multi-vector retrieval modes, which require a different query interface.
OpenAI text-embedding-3-small:
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input=["Your text here"],
)
embedding = response.data[0].embedding # list of 1536 floats
For batch embedding, pass a list of strings to input. The API handles batching up to 2048 inputs per request. Don't call it one document at a time.
Cost reality at scale
Let's make this concrete. 10 million documents, average 500 tokens each = 5 billion tokens total.
- text-embedding-3-small: $100
- Voyage-3: $300
- Cohere embed-v4: $500
- text-embedding-3-large: $650
- BGE-M3: GPU time only — around $10-30 of cloud compute if you batch efficiently
The gap is large. For a one-time embedding job, it might not matter. For a system that re-embeds frequently (real-time updates, document revisions), the ongoing cost adds up.
The one thing you can't do: mix models
You cannot store embeddings from different models in the same vector index and query across them. They live in different semantic spaces — a similarity score between a text-embedding-3-small vector and a BGE-M3 vector is meaningless.
If you switch models after you've built your index, you re-embed everything. There's no shortcut.
This makes the initial model choice load-bearing. If you're building a RAG pipeline for production, pick based on your actual requirements: What's your privacy posture? What's your expected document count in 12 months? Do you have images? Is the content highly technical?
For most teams: start with text-embedding-3-small, evaluate on a real sample of your data, and only upgrade if the quality metrics show you need to. The MTEB rankings are a starting point, not a verdict. Your domain is different from the benchmark domains.
If you're already running RAG and wondering why retrieval quality is inconsistent, check agentic RAG patterns — sometimes the issue isn't the embedding model, it's the retrieval strategy. And if you're deciding between building a RAG system vs. fine-tuning vs. just prompting, this comparison breaks down the tradeoffs.



