Most teams pick an embedding model the same way they pick a sorting algorithm: look up the benchmark, take the top result, ship it. For sorting algorithms that works fine. For embeddings it reliably produces a retrieval system that looks great on paper and underperforms on the actual product.
The MTEB leaderboard is not useless. But it is measuring retrieval on academic corpora with clean, well-formed queries against documents that look nothing like your internal docs, your customer support tickets, or your codebase. Ranking third on MTEB while being the worst model for your domain is entirely possible, and it happens constantly.
Here is what actually determines embedding model quality for a specific application, and how to measure it before you build on top of the wrong foundation.
What MTEB Is Actually Measuring
MTEB (Massive Text Embedding Benchmark) aggregates scores across 56 tasks: retrieval, classification, clustering, semantic similarity, and others. The headline retrieval score comes from datasets like MSMARCO (Bing query logs against web documents), NFCorpus (nutrition science), and TREC-COVID (biomedical literature).
These are genuinely hard benchmarks. A model that ranks high here is a good general-purpose embedding model. But “good at retrieving Wikipedia paragraphs for web search queries” and “good at retrieving your company’s runbook sections for Slack queries from engineers” are different problems.
Three reasons the leaderboard misleads in practice:
1. Vocabulary mismatch. MTEB datasets are heavily English, heavily web text. If your corpus uses domain-specific abbreviations, product names, internal jargon, or non-English content, the top MTEB models were not trained to represent those tokens well. A model that learned “MSMARCO-style retrieval” does not know what your product’s internal naming conventions mean.
2. Query distribution mismatch. MTEB retrieval queries tend to be natural-language questions: “What is the capital of France?” or “How does kidney disease progress?” Your users may type product IDs, error codes, short fragments, or code snippets. The embedding space that works for question-answering may not cluster your actual queries near the right documents.
3. Task mix inflation. A model that is excellent at classification but mediocre at retrieval can score well overall on MTEB. For a RAG application, only the retrieval subset of MTEB matters. Look at BEIR scores specifically, not the overall MTEB rank.
The Model Landscape in 2026
Before getting into evaluation methodology, here is the current market:
| Model | Dimensions | MTEB Retrieval | Cost | Notes |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 256-3072 | 55.4 | $0.13 / 1M tokens | Matryoshka, truncatable |
| OpenAI text-embedding-3-small | 512-1536 | 51.7 | $0.02 / 1M tokens | Best cost-to-quality ratio for most cases |
| Cohere embed-v4 | 1024 | 56.1 | $0.10 / 1M tokens | Strong multilingual, images |
| Voyage AI voyage-3 | 1024 | 57.3 | $0.12 / 1M tokens | Best on code, technical text |
| Voyage AI voyage-3-lite | 512 | 54.2 | $0.02 / 1M tokens | Fast, cheap, solid |
| BGE-en-v2.0 (open source) | 1024 | 55.8 | Free (self-hosted) | Best open source retrieval |
| Nomic Embed v2 (open source) | 768 | 52.1 | Free (self-hosted) | 8192-token context, long docs |
| GTE-Qwen2-7B (open source) | 3584 | 58.1 | Free (self-hosted, expensive) | Top open source, needs GPU |
The spread between first and last on MTEB retrieval here is about 6 points. On your data, the spread can go in any direction.
Why Domain-Specific Corpora Flip the Rankings
Here is a concrete example. A team building a code search tool for a large Python codebase ran a proper evaluation. They took 500 real developer queries from their logs (things like “find the function that handles JWT expiry” or “where is rate limiting implemented for the API gateway”) and a set of ground-truth relevant code chunks.
Their results:
| Model | MTEB Retrieval Rank | Recall@5 on Internal Code |
|---|---|---|
| GTE-Qwen2-7B | 1 (best) | 0.61 |
| Voyage voyage-3 | 3 | 0.74 |
| BGE-en-v2.0 | 5 | 0.71 |
| OpenAI text-embedding-3-large | 6 | 0.68 |
| Fine-tuned BGE-en-v2.0 | n/a | 0.87 |
Voyage voyage-3 specifically trained on code data. That shows up in results. The top MTEB model came last on actual retrieval. The fine-tuned open-source model dominated everything.
This pattern repeats across domains. Legal text, medical records, internal wikis, support tickets - any corpus with specific vocabulary and query patterns will diverge from the MTEB ordering.
How to Build Your Own Eval Set
You need 100-300 query-document pairs from your actual domain. This is the single most important investment you can make before picking an embedding model. The process:
Step 1: Mine your logs. If you have an existing search system, pull 200-300 real queries that got a click or positive feedback. Those represent what your users actually ask. If you are building from scratch, have domain experts write 50-100 representative queries.
Step 2: Label relevant documents. For each query, identify the 1-5 documents in your corpus that best answer it. This does not need to be exhaustive. You just need to know “for this query, these chunks are relevant.”
Step 3: Run retrieval and compute recall. For each query, embed it and retrieve top-k from your vector store. Recall@5 means: of the documents you labeled as relevant, how many appear in the top 5 results?
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Dict
def evaluate_embedding_model(
model_name: str,
queries: List[str],
relevant_docs: List[List[int]], # indices into corpus
corpus: List[str],
k: int = 5,
) -> Dict[str, float]:
model = SentenceTransformer(model_name)
corpus_embeddings = model.encode(corpus, normalize_embeddings=True, batch_size=64)
query_embeddings = model.encode(queries, normalize_embeddings=True)
recalls = []
mrr_scores = []
for i, q_emb in enumerate(query_embeddings):
# Cosine similarity (normalized vectors, so dot product = cosine)
scores = corpus_embeddings @ q_emb
top_k_indices = np.argsort(scores)[::-1][:k].tolist()
relevant = set(relevant_docs[i])
retrieved = set(top_k_indices)
recall_at_k = len(relevant & retrieved) / len(relevant) if relevant else 0.0
recalls.append(recall_at_k)
# MRR: rank of the first relevant document
for rank, idx in enumerate(top_k_indices, start=1):
if idx in relevant:
mrr_scores.append(1.0 / rank)
break
else:
mrr_scores.append(0.0)
return {
"model": model_name,
"recall_at_k": float(np.mean(recalls)),
"mrr": float(np.mean(mrr_scores)),
"k": k,
}
# Compare models head-to-head on your eval set
models_to_test = [
"BAAI/bge-en-icl",
"nomic-ai/nomic-embed-text-v2",
"voyageai/voyage-3", # via their API client
]
for model in models_to_test:
result = evaluate_embedding_model(model, your_queries, your_relevant_docs, your_corpus)
print(f"{result['model']}: recall@5={result['recall_at_k']:.3f}, MRR={result['mrr']:.3f}")
This takes maybe 2 hours to set up and run. It will tell you more about which model to use than any amount of reading benchmark reports.
What to measure beyond recall: If your application surfaces a ranked list, also compute NDCG@10 (normalized discounted cumulative gain), which weights correct results higher when they appear earlier. If you only care about “did the right document appear at all,” recall@k is enough.
The Dimensions vs Cost Tradeoff
Embedding dimension is a direct multiplier on storage, memory, and ANN (approximate nearest neighbor) search latency. A 3072-dimension embedding is 3x larger than a 1024-dimension one and roughly 2-3x slower to search at scale.
For most production systems, the dimension question comes down to this table:
| Dimension range | Storage (1M docs) | Query latency (pgvector, 1M rows) | When to use |
|---|---|---|---|
| 256-512 | 0.5-1 GB | Under 50ms | High query volume, latency-sensitive |
| 768-1024 | 1.5-2 GB | 50-100ms | General use, best balance |
| 1536-2048 | 3-4 GB | 100-200ms | High-value low-volume retrieval |
| 3072+ | 6+ GB | 200-400ms | Rarely justified outside benchmarks |
The move to higher dimensions has diminishing returns. Going from 768 to 3072 dimensions typically improves recall by 2-4% on in-domain data. Going from a mismatched model to a domain-matched one at 768 dimensions can improve recall by 15-30%.
Matryoshka Embeddings - Truncate Without Losing Much
OpenAI’s text-embedding-3 models and several open-source models (BGE, Nomic) support Matryoshka representation learning (MRL). The idea: during training, the model is penalized for both the full embedding and truncated versions of it. The result is that the first N dimensions of the embedding already encode the most important semantic information, and you can truncate to any size without retraining.
from openai import OpenAI
client = OpenAI()
# Full 3072-dim embedding
full_response = client.embeddings.create(
model="text-embedding-3-large",
input="your text here",
)
full_embedding = full_response.data[0].embedding # 3072 floats
# Truncated to 256 dims - still useful, much cheaper to store/search
truncated_response = client.embeddings.create(
model="text-embedding-3-large",
input="your text here",
dimensions=256, # model applies MRL truncation internally
)
small_embedding = truncated_response.data[0].embedding # 256 floats
For most RAG applications on generic English content, text-embedding-3-large at 512-768 dimensions is a pragmatic sweet spot. You get better quality than text-embedding-3-small and far lower storage cost than the full 3072.
For open source, BGE-M3 and Nomic Embed v2 both support MRL. Run at 512 dimensions on CPU-constrained infrastructure without sacrificing much quality.
When Fine-Tuning Is Worth It
Fine-tuning an embedding model produces the biggest quality gains but requires the most investment. The tradeoff curve is clear:
- Under 500 query-document pairs: fine-tuning will overfit, not worth it
- 500-2000 pairs: marginal gains, probably not worth the infrastructure cost
- 2000-10000 pairs: 10-20% recall improvement, worth it for high-value applications
- 10000+ pairs: significant gains (20-40%), clearly worth it
The training signal is contrastive learning: given a query and a relevant document (positive pair), pull their embeddings closer together, while pushing them away from random documents (negative pairs). The key implementation detail is hard negative mining - negatives that are superficially similar to the query but not actually relevant. Easy negatives (random documents) teach the model almost nothing. Hard negatives (documents that use similar words but answer a different question) are where the learning happens.
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
from sentence_transformers import InputExample
# Your training data: (query, positive_doc, hard_negative_doc) triples
train_examples = [
InputExample(
texts=[
"how to handle JWT expiry",
"The token refresh endpoint validates the existing JWT and issues a new one...", # positive
],
label=1.0,
),
# ... more examples
]
model = SentenceTransformer("BAAI/bge-en-icl")
# MultipleNegativesRankingLoss: efficient contrastive learning
# treats other positives in the batch as implicit negatives
train_loss = losses.MultipleNegativesRankingLoss(model)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./fine-tuned-embeddings",
show_progress_bar=True,
)
For hard negative mining, a practical approach: run retrieval with your current model on all training queries, take the top-10 retrieved documents, exclude the known positives, and use the remaining as negatives. These are by definition documents the model currently confuses with the right answer.
Fine-Tuning vs Switching Models
Before committing to fine-tuning, check whether a domain-specialized hosted model already covers your use case. Voyage AI has specialized models for code and legal text. Cohere’s embed-v4 is significantly better than base MTEB rank for multilingual and non-English content. A specialized hosted model at $0.10/1M tokens may outperform a fine-tuned general model with far less engineering work.
Fine-tuning becomes clearly the right choice when:
- You have internal jargon that no hosted model has ever seen
- Query patterns are highly specific (error codes, product SKUs, internal system names)
- You are already self-hosting for cost reasons
- You need to retrain regularly as the corpus evolves
The Evaluation Loop You Actually Need
Picking a model once is not enough. Embedding quality should be a tracked metric that you regress on. Here is the minimum setup:
eval set (100-300 labeled pairs)
|
v
[run recall@5, MRR after each model change or fine-tuning run]
|
v
[track in experiment tracking (MLflow, W&B, or a spreadsheet)]
|
v
[ship model change only if recall@5 improves > 1% on eval set]
Production-only signal is also valuable but slower to collect. Log which retrieved chunks are actually used in the final generated answer (via citation tracking or LLM-as-judge on answers). Chunks that get retrieved but never cited are your recall-precision gap in the wild.
A good production signal to add: for every query, log the maximum cosine similarity score of the retrieved results. A distribution shift in this score (e.g., average max similarity dropping from 0.82 to 0.71) is an early warning that query patterns have drifted from your corpus or embedding space.
The Honest Assessment
What actually works and is low-risk to adopt right now:
- Run a 2-hour evaluation on 100-200 real queries from your domain before committing to a model. This almost always overturns the MTEB-based assumption.
- Use MRL truncation for cost-sensitive applications. text-embedding-3-large at 512 dimensions beats text-embedding-3-small at full dimensions in most tests, at the same storage cost.
- For code search: Voyage voyage-3 or a fine-tuned BGE model. General-purpose models are measurably worse here.
- For multilingual or non-English content: Cohere embed-v4 or a multilingual BGE variant.
What requires more work than teams usually budget for:
- Fine-tuning needs hard negatives to work well. Using random negatives produces a model that barely improves over the base. Building a hard-negative mining pipeline adds 1-2 weeks to the effort.
- The eval set needs to stay updated as user query patterns shift. A set you built from logs 6 months ago may not reflect what users ask today.
Where the effort is almost never worth it:
- Chasing 1-2 point MTEB improvements between similar general-purpose models. The difference does not materialize on your data with any consistency.
- Fine-tuning with fewer than 2000 examples. The model will memorize, not generalize. Collect more data first.
The teams with good retrieval in production are not using the highest-ranked model. They are using whatever model score best on their own labeled queries, with MRL tuned to their latency budget, and they measured this before building the rest of the system. That ordering matters. Embedding choice is infrastructure - change it later and you are re-indexing everything.
Comments