When general-purpose embeddings fall short, what drives the decision to customize, and how the training pipeline works.
Why Custom Embeddings?
When General Embeddings Fall Short
Off-the-shelf embedding models (OpenAI text-embedding-3, Cohere embed, BGE, E5) capture broad semantic similarity well. But they can fail on:
- Domain jargon: Internal abbreviations, product names, and technical terms that rarely appear in public training data.
- Near-synonym distinctions: In law, "negligence" and "gross negligence" have very different legal consequences, but general embeddings may place them close together.
- Entity-heavy corpora: When retrieval depends on matching specific entity names (drug names, part numbers, case IDs) rather than general meaning.
- Multilingual enterprise jargon: Internal terms mixed across languages in international organizations.
The Decision Framework
Custom embeddings are justified when:
- Evaluation shows repeated domain misses that chunking and reranking cannot fix.
- The value of better retrieval exceeds the cost of training, serving, and migrating the index.
- You have a trustworthy offline benchmark that reflects your real workload.
Cost of Customization
| Cost Factor | What It Involves |
|---|---|
| Data curation | Collecting query-document pairs with relevance labels |
| Training compute | GPU hours for fine-tuning (typically hours to days) |
| Re-indexing | Re-embedding the entire corpus with the new model |
| Threshold recalibration | Previous similarity thresholds no longer apply |
| Ongoing maintenance | Retraining as the domain evolves |
See Topic 2: Domain Adaptation for the specific approaches to customization.
Python Example — Building an Offline Eval Benchmark
import json
from typing import List, Dict
def build_retrieval_benchmark(
queries: List[str],
relevant_docs: Dict[str, List[str]],
corpus: List[str]
) -> Dict:
"""Build an offline benchmark for retrieval evaluation.
Args:
queries: list of real user queries
relevant_docs: mapping query -> list of relevant doc IDs
corpus: list of all documents
"""
benchmark = {
"queries": [],
"corpus_size": len(corpus),
}
for query in queries:
entry = {
"query": query,
"relevant": relevant_docs.get(query, []),
"num_relevant": len(relevant_docs.get(query, [])),
}
benchmark["queries"].append(entry)
# Save for reproducible evaluation
with open("retrieval_benchmark.json", "w") as f:
json.dump(benchmark, f, indent=2)
print(f"Benchmark: {len(queries)} queries, "
f"{len(corpus)} docs, "
f"avg {sum(len(v) for v in relevant_docs.values())/len(queries):.1f} "
f"relevant per query")
return benchmark
How many labeled examples do you need for a retrieval benchmark?
Can you use reranking instead of custom embeddings?
What is the difference between fine-tuning embeddings and training from scratch?
Domain Adaptation Approaches
Driven by Errors You Can Name
Domain adaptation should be driven by retrieval errors you can name. If the system is missing exact domain distinctions, you need data and objectives that teach those distinctions explicitly. Common error patterns that drive adaptation:
- Synonym collapse: "Tylenol" and "acetaminophen" should be identical; the model treats them as different.
- False similarity: "myocardial infarction" and "myocardial inflammation" are retrieved interchangeably, but they are clinically different.
- Jargon blindness: Internal terms like "P0 escalation" or "T2-weighted MRI" have no meaning to the general model.
Adaptation Strategy by Data Availability
| Available Data | Best Approach | Expected Gain |
|---|---|---|
| Unlabeled domain text only | Continued pretraining (MLM/contrastive) | Moderate (domain vocabulary alignment) |
| 50–500 labeled pairs | Few-shot fine-tuning with synthetic negatives | Moderate to high |
| 500–10K labeled pairs | Supervised contrastive + hard negative mining | High |
| 10K+ labeled pairs | Full fine-tuning with curriculum (easy → hard negatives) | Highest |
Synthetic Data Generation
When labeled pairs are scarce, you can use an LLM to generate synthetic training data. Given a document chunk, ask a strong LLM to generate plausible queries that the chunk would answer. This "doc2query" approach can produce thousands of training pairs from an unlabeled corpus. The quality of synthetic data should always be validated against your benchmark. See Topic 3: Hard Negatives for how to pair these synthetic queries with effective negative examples.
Python Example — Generating Synthetic Training Pairs
from openai import OpenAI
client = OpenAI()
def generate_training_queries(document_chunk: str, n: int = 3):
"""Generate synthetic queries for a document chunk.
This 'doc2query' approach creates training pairs
when labeled data is scarce.
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Generate {n} diverse search queries that
this document chunk would be the ideal answer for.
Make queries realistic (how a user would actually search).
Return one query per line, no numbering.
Document chunk:
{document_chunk}"""
}],
temperature=0.8
)
queries = response.choices[0].message.content.strip().split("\n")
# Each (query, chunk) pair becomes a positive training example
return [{"query": q.strip(), "positive": document_chunk}
for q in queries if q.strip()]
How much domain text is needed for continued pretraining?
Does domain adaptation risk forgetting general capabilities?
Can you adapt embedding models using RLHF or preference data?
Hard Negatives
Why Hard Negatives Matter
Without hard negatives, the model learns an overly easy decision boundary. It can tell that a medical query should not return cooking recipes, but it cannot distinguish which of several relevant-looking medical documents is actually the right one. This is the difference between recall (finding the right neighborhood) and precision (finding the right house).
Mining Strategies
| Strategy | How It Works | Difficulty Level |
|---|---|---|
| Random negatives | Sample random documents from the corpus | Easy (good for early training) |
| BM25 negatives | Top BM25 results that are not labeled relevant | Medium (keyword-similar but not relevant) |
| In-batch negatives | Other positives in the same training batch | Medium (topically related) |
| Embedding negatives | Nearest neighbors from a current embedding that are not relevant | Hard (semantically close but wrong) |
| LLM-generated | Ask an LLM to create plausible-but-wrong documents | Very hard (designed to confuse) |
Curriculum: Easy to Hard
Best practice is to train with a curriculum: start with easy negatives so the model learns basic topical separation, then gradually introduce harder negatives as training progresses. This avoids the model being overwhelmed by difficult examples before it has learned basic distinctions.
See Topic 4: Training Losses for the loss functions that use these negatives during training.
Python Example — Mining Hard Negatives from Embeddings
import numpy as np
from sentence_transformers import SentenceTransformer
def mine_hard_negatives(
queries, positives, corpus,
model_name="BAAI/bge-base-en-v1.5",
top_k=10, n_negatives=3
):
"""Mine hard negatives using current embedding model.
For each query, find the top-k nearest corpus items
that are NOT in the positive set.
"""
model = SentenceTransformer(model_name)
# Encode everything
q_emb = model.encode(queries, normalize_embeddings=True)
c_emb = model.encode(corpus, normalize_embeddings=True)
# Compute cosine similarities
sims = q_emb @ c_emb.T # [n_queries, n_corpus]
triplets = []
for i, query in enumerate(queries):
# Get indices sorted by similarity (descending)
ranked = np.argsort(-sims[i])
pos_set = set(positives[i])
# Hard negatives: highest-similarity non-positives
hard_negs = [
corpus[idx] for idx in ranked
if idx not in pos_set
][:n_negatives]
for neg in hard_negs:
triplets.append({
"query": query,
"positive": corpus[positives[i][0]],
"negative": neg,
})
return triplets
How many negatives per query should you use?
Can hard negatives that are too hard hurt training?
What is cross-encoder distillation for negative mining?
Training Losses for Embedding Fine-Tuning
Loss Functions Compared
| Loss | Inputs | Key Property | Best For |
|---|---|---|---|
| Contrastive (Siamese) | Pairs + label | Fixed margin between positive/negative distances | Binary similarity (same/different) |
| Triplet | Anchor, positive, negative | Relative ordering: pos closer than neg | Fine-grained ranking with explicit negatives |
| Multiple Negatives Ranking (MNRL) | Anchor, positive (negatives from batch) | Softmax cross-entropy over batch | Large-batch training, no explicit negative mining |
| InfoNCE | Anchor, positive, N negatives | Contrastive with temperature scaling | Self-supervised and supervised contrastive learning |
| Cosine similarity | Pairs + continuous score | Direct regression on similarity score | Semantic textual similarity (STS) |
Choose by Downstream Behavior
The loss should be evaluated through downstream ranking quality, not chosen because it is fashionable. Retrieval is the target behavior, so training should be judged by retrieval metrics (recall@k, MRR, NDCG). A loss that produces great STS scores but poor retrieval results is the wrong choice for a retrieval system.
Practical Recommendations
- Start with MNRL: It is simple, uses in-batch negatives (no explicit mining needed), and works well with large batch sizes.
- Add hard negatives with triplet/InfoNCE: Once the baseline works, add mined hard negatives (see Topic 3: Hard Negatives) to push precision.
- Tune temperature: The temperature parameter in InfoNCE/MNRL controls how "sharp" the similarity distribution is. Lower temperature means stricter matching.
Python Example — Fine-Tuning with Sentence Transformers
from sentence_transformers import (
SentenceTransformer, InputExample, losses
)
from torch.utils.data import DataLoader
# Load base model to fine-tune
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Prepare training data as (query, positive_doc) pairs
# MNRL uses in-batch negatives automatically
train_examples = [
InputExample(texts=[
"metformin side effects in elderly",
"Common adverse effects of metformin in patients over 65..."
]),
InputExample(texts=[
"dosing guidelines for lisinopril",
"Recommended starting dose of lisinopril is 10mg daily..."
]),
# ... more (query, positive) pairs
]
# Use MNRL: in-batch negatives, no mining needed
train_dataloader = DataLoader(train_examples, batch_size=32,
shuffle=True)
train_loss = losses.MultipleNegativesRankingLoss(model)
# Fine-tune for 1 epoch
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
output_path="./domain-adapted-embedding",
)
What batch size should you use for MNRL?
How does the temperature parameter affect training?
Long Document Retrieval
Why One Vector Per Document Fails
A 50-page document covers many subtopics. Compressing it into a single 768- or 1536-dimensional vector necessarily loses most of the specific content. The resulting embedding captures the document's general topic but cannot match queries about specific paragraphs, figures, or data points within it.
Chunking Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size token chunks | Split every N tokens with overlap | Simple, works for most text |
| Semantic chunking | Split at natural boundaries (paragraphs, sections, headings) | Structured documents |
| Recursive splitting | Try paragraph, then sentence, then token-level splits | Mixed-format documents |
| Sliding window | Overlapping windows ensure no content falls in a gap | Narrative text without clear sections |
Chunk Size Trade-offs
- Too small (50–100 tokens): Chunks lack context. The embedding captures a sentence fragment that may be ambiguous without surrounding text.
- Too large (1000+ tokens): Chunks contain too many topics. The embedding averages across unrelated content, reducing retrieval precision.
- Sweet spot (200–512 tokens): Large enough for context, small enough for specificity. Most production systems land here.
Add overlapping windows (e.g., 50-token overlap) to ensure content at chunk boundaries is not lost. Include document metadata (title, section heading) as a prefix to each chunk to improve embedding quality.
Python Example — Semantic Chunking with Overlap
from typing import List
import tiktoken
def chunk_document(
text: str,
max_tokens: int = 400,
overlap_tokens: int = 50,
doc_title: str = "",
) -> List[str]:
"""Split a document into overlapping chunks with metadata prefix."""
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + max_tokens
chunk_tokens = tokens[start:end]
chunk_text = enc.decode(chunk_tokens)
# Prefix with document metadata for richer embedding
if doc_title:
chunk_text = f"[{doc_title}] {chunk_text}"
chunks.append(chunk_text)
# Advance by (max_tokens - overlap) to create overlap
start += max_tokens - overlap_tokens
print(f"Document: {len(tokens)} tokens -> {len(chunks)} chunks")
return chunks
Should you store parent document metadata with each chunk?
What is hypothetical document embedding (HyDE)?
How does late interaction (ColBERT) differ from standard chunk retrieval?
Deploying, monitoring, and evolving retrieval systems in production — multilingual considerations, compression trade-offs, and model migration.
Multilingual Embedding Systems
Key Considerations
- Language coverage: Does the embedding model support all your target languages? Some models have strong coverage for European languages but weak coverage for low-resource languages (Thai, Swahili, Tagalog).
- Script normalization: Different Unicode representations of the "same" character (e.g., full-width vs. half-width CJK) can produce different embeddings. Normalize before embedding.
- Same-language vs. cross-language: These are different tasks with different failure modes. A model may excel at same-language French retrieval but poorly align French queries with English documents.
- Evaluation per language: Build evaluation sets for each target language. A model reporting 95% recall in English may only achieve 70% recall in Korean.
Multilingual Embedding Models
| Model | Languages | Strengths |
|---|---|---|
| multilingual-e5-large | 100+ | Strong cross-lingual retrieval, instruction-tuned |
| Cohere embed-multilingual | 100+ | Production API, good coverage |
| BGE-M3 | 100+ | Multi-granularity (dense + sparse + ColBERT) |
| OpenAI text-embedding-3 | Broad | Convenient API, adjustable dimensions |
See Topic 2: Domain Adaptation for how to fine-tune multilingual models on domain-specific data in multiple languages.
Python Example — Cross-Lingual Retrieval Test
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("intfloat/multilingual-e5-large")
# Test cross-lingual alignment:
# Query in English, documents in multiple languages
query = "query: What are the side effects of aspirin?"
docs = [
"passage: Aspirin side effects include stomach bleeding.", # EN
"passage: Nebenwirkungen von Aspirin sind Magenblutungen.", # DE
"passage: Les effets secondaires de l'aspirine incluent...", # FR
"passage: The history of bicycle manufacturing.", # EN irrelevant
]
# Encode with instruction prefix (required for E5)
q_emb = model.encode([query], normalize_embeddings=True)
d_emb = model.encode(docs, normalize_embeddings=True)
# Cosine similarities
sims = (q_emb @ d_emb.T)[0]
for doc, sim in sorted(zip(docs, sims), key=lambda x: -x[1]):
print(f" {sim:.3f} {doc[:60]}...")
# Expected: all three aspirin docs rank above the bicycle doc
How do you handle mixed-language documents?
Should you use separate indexes per language?
Compression & Quantization
Engineering Economics
The interview-safe answer is to frame this as engineering economics. You test whether a cheaper representation still preserves enough of the ranking signal to meet product goals. For a 1-billion-vector index:
| Precision | Memory per 768-dim vector | Total for 1B vectors |
|---|---|---|
| FP32 | 3,072 bytes | ~2.9 TB |
| FP16 | 1,536 bytes | ~1.4 TB |
| INT8 | 768 bytes | ~715 GB |
| Binary | 96 bytes | ~89 GB |
Product Quantization (PQ)
Product quantization divides each vector into subvectors and quantizes each subvector independently using a learned codebook. This achieves higher compression ratios (32x–64x) than simple scalar quantization while preserving more ranking quality. It is the standard compression technique in FAISS and most production vector databases.
When to Compress
- Always use FP16: The loss is negligible and the savings are free.
- Use INT8 for large indexes: When memory is the bottleneck and 1% recall loss is acceptable.
- Use binary for candidate generation: Fast approximate search to get top-1000, then re-rank with full-precision vectors.
See Topic 8: Similarity Thresholds for why thresholds must be recalibrated after changing precision.
Python Example — FAISS Index with Product Quantization
import faiss
import numpy as np
dim = 768
n_vectors = 1_000_000
# Generate sample embeddings (replace with real embeddings)
vectors = np.random.randn(n_vectors, dim).astype("float32")
faiss.normalize_L2(vectors) # normalize for cosine similarity
# Option 1: Flat index (exact, but uses full memory)
index_flat = faiss.IndexFlatIP(dim)
index_flat.add(vectors)
print(f"Flat: {index_flat.ntotal * dim * 4 / 1e9:.2f} GB")
# Option 2: IVF + PQ (compressed, approximate)
n_lists = 1024 # number of Voronoi cells
m = 48 # number of subquantizers
n_bits = 8 # bits per subquantizer
quantizer = faiss.IndexFlatIP(dim)
index_pq = faiss.IndexIVFPQ(quantizer, dim, n_lists, m, n_bits)
index_pq.train(vectors)
index_pq.add(vectors)
index_pq.nprobe = 32 # search 32 of 1024 cells
# Compare recall: search both indexes for same queries
queries = np.random.randn(100, dim).astype("float32")
faiss.normalize_L2(queries)
_, exact = index_flat.search(queries, 10)
_, approx = index_pq.search(queries, 10)
# Recall@10: fraction of true top-10 found by PQ
recall = np.mean([
len(np.intersect1d(e, a)) / 10
for e, a in zip(exact, approx)
])
print(f"PQ Recall@10: {recall:.1%}")
What is Matryoshka embedding and how does it help compression?
dimensions parameter.Can you combine quantization with dimensionality reduction?
Similarity Thresholds
Thresholds Are Pipeline Policy
Thresholds should be tuned jointly with reranking, answer generation, and abstention behavior. A threshold that maximizes offline recall may still hurt answer quality if it floods the model with weak context. The right threshold is the one that produces the best end-to-end answer quality, not the best retrieval metrics in isolation.
How to Set Thresholds
- Collect validation data: Get query-document pairs with relevance labels (relevant / not relevant).
- Compute similarity scores: Embed all queries and documents, compute cosine similarity for each pair.
- Plot precision-recall curve: Vary the threshold and measure precision and recall at each point.
- Choose based on downstream impact: A threshold that maximizes F1 may not be ideal if false positives (noise context) are much more harmful than false negatives (missing context).
What Invalidates a Threshold
| Change | Why Threshold Needs Recalibration |
|---|---|
| New embedding model | Different models produce different similarity distributions |
| Corpus changes | Adding documents shifts the similarity landscape |
| Query pattern shifts | Users asking different types of questions |
| Quantization/compression | Compression distorts distances, changing effective thresholds |
See Topic 9: Monitoring Retrieval Drift for how to detect when thresholds need adjustment, and Topic 7: Compression & Quantization for compression-induced threshold shifts.
Python Example — Finding the Optimal Threshold
import numpy as np
from sklearn.metrics import precision_recall_curve
def find_optimal_threshold(
similarities: np.ndarray,
labels: np.ndarray,
beta: float = 1.0
) -> dict:
"""Find the threshold that maximizes F-beta score.
Args:
similarities: cosine similarities for each (query, doc) pair
labels: 1 = relevant, 0 = not relevant
beta: F-beta weight (beta > 1 favors recall,
beta < 1 favors precision)
"""
precisions, recalls, thresholds = precision_recall_curve(
labels, similarities
)
# Compute F-beta at each threshold
f_scores = (
(1 + beta**2) * precisions * recalls /
(beta**2 * precisions + recalls + 1e-8)
)
best_idx = np.argmax(f_scores)
return {
"threshold": float(thresholds[best_idx]),
"precision": float(precisions[best_idx]),
"recall": float(recalls[best_idx]),
"f_score": float(f_scores[best_idx]),
}
Should you use a fixed threshold or a dynamic one?
What happens when retrieval returns zero results above the threshold?
Monitoring Retrieval Drift
Sources of Drift
Drift can come from changes in user language, data ingestion, new product terms, or updated business processes. The embedding model remains frozen, but the world it represents keeps changing:
- User language shifts: New jargon, trending terms, or changes in how users phrase queries.
- Corpus evolution: New documents added, old documents removed or updated. The embedding index may not reflect current reality.
- Business process changes: A reorganization or product rename means old queries no longer map to the right documents.
- Seasonal patterns: Query patterns shift with business cycles (tax season, product launches, regulatory deadlines).
Building a Monitoring Pipeline
| Component | What to Track | Alert When |
|---|---|---|
| Canary queries | Recall on a fixed set of known-good pairs | Recall drops below baseline |
| Score distribution | Mean/median/p95 of top-k similarity scores | Distribution shifts significantly |
| User feedback | Thumbs up/down, click-through, answer acceptance | Satisfaction trends downward |
| LLM abstention rate | How often the LLM says "I don't have enough info" | Rate increases (may indicate retrieval gaps) |
| Empty result rate | Queries returning zero results above threshold | Rate increases |
See Topic 8: Similarity Thresholds for when to recalibrate thresholds as drift is detected.
Python Example — Canary Set Monitoring
import json, time
from typing import List, Dict
def run_canary_check(
search_fn,
canary_set: List[Dict],
k: int = 10,
) -> Dict:
"""Check retrieval quality against known-good pairs.
Args:
search_fn: function(query) -> list of doc_ids
canary_set: [{"query": str, "expected_doc_id": str}, ...]
k: number of results to check
"""
hits = 0
scores = []
failures = []
for canary in canary_set:
results = search_fn(canary["query"], top_k=k)
result_ids = [r["id"] for r in results]
if canary["expected_doc_id"] in result_ids:
hits += 1
rank = result_ids.index(canary["expected_doc_id"]) + 1
scores.append(1.0 / rank) # reciprocal rank
else:
failures.append(canary["query"])
scores.append(0.0)
result = {
"timestamp": time.time(),
"recall_at_k": hits / len(canary_set),
"mrr": sum(scores) / len(scores),
"failures": failures,
}
# Alert if recall drops below threshold
if result["recall_at_k"] < 0.9:
print(f"ALERT: Canary recall dropped to "
f"{result['recall_at_k']:.1%}")
return result
How often should you run canary checks?
How do you distinguish drift from a bug?
When should drift trigger re-training vs. re-indexing?
Embedding Model Migration
Not Just Swapping an API Call
Changing the embedding model is not just swapping one API call. It is a data migration and quality management exercise. Vectors from different models live in incompatible spaces — you cannot search a new-model query against an old-model index. Every document must be re-embedded.
Migration Checklist
| Step | Details | Common Pitfall |
|---|---|---|
| Baseline benchmark | Measure current recall, MRR, NDCG on eval set | Migrating without a baseline to compare against |
| Corpus re-embedding | Embed all documents; budget compute and time | Underestimating re-embedding time for large corpora |
| Threshold recalibration | Find new optimal thresholds on validation data | Reusing old thresholds (different score distributions) |
| Reranker compatibility | Test if existing reranker works with new embeddings | Assuming reranker is model-agnostic when it is not |
| A/B testing | Route 5–10% of traffic to new index, measure user impact | Big-bang cutover without shadow comparison |
| Rollback plan | Keep old index available for quick revert | Deleting old index before verifying new one |
Dual-Run Strategy
The safest migration pattern is to dual-run both indexes during transition:
- Query both indexes for every request.
- Serve results from the old index (no user impact).
- Log and compare results from both indexes offline.
- When the new index consistently matches or exceeds the old one, switch traffic gradually.
- Keep the old index available for rollback for 1–2 weeks after full cutover.
See Topic 8: Similarity Thresholds and Topic 9: Monitoring Retrieval Drift for the calibration and monitoring that must accompany migration.
Python Example — Dual-Index Comparison
from typing import List, Dict, Callable
import json
def compare_indexes(
query: str,
old_search: Callable,
new_search: Callable,
k: int = 10,
) -> Dict:
"""Compare retrieval results from old and new embedding indexes.
Returns overlap metrics and both result sets for analysis.
"""
old_results = old_search(query, top_k=k)
new_results = new_search(query, top_k=k)
old_ids = [r["id"] for r in old_results]
new_ids = [r["id"] for r in new_results]
# Measure overlap between result sets
overlap = len(set(old_ids) & set(new_ids))
# Rank correlation (Jaccard similarity of top-k)
jaccard = overlap / len(set(old_ids) | set(new_ids))
# Top-1 agreement
top1_match = old_ids[0] == new_ids[0] if old_ids and new_ids else False
return {
"query": query,
"overlap_at_k": overlap / k,
"jaccard": jaccard,
"top1_match": top1_match,
"old_top3": old_ids[:3],
"new_top3": new_ids[:3],
}
# Run over benchmark queries and log for analysis
# Look for patterns in disagreements to understand
# where the new model differs from the old one