Ch 12: Custom Embeddings & Retrieval Optimization

Foundations

When general-purpose embeddings fall short, what drives the decision to customize, and how the training pipeline works.

Why Custom Embeddings?

A team chooses custom embeddings when domain-specific distinctions matter more than general semantic similarity. In medicine, finance, law, or internal enterprise knowledge, a generic embedding model may collapse distinctions that your application cannot afford to lose.

💡 General embeddings see "prescription" and "over-the-counter" as similar (both are medicines). A custom medical embedding knows that confusing them could be dangerous.

Optimization Ladder — exhaust cheaper levers before training custom embeddings

Data Hygiene

Better chunking, document cleanup, metadata enrichment

Cheap, high-leverage, and easy to validate.

Ranking

Add a reranker or metadata-aware filters

Often improves precision without retraining embeddings.

Query Strategy

Reformulate queries, add hybrid search (keyword + vector)

Helps recall on short or ambiguous queries.

Custom Training

Domain-adapt the embedding model

Best once earlier levers are exhausted and benchmarks prove the gap.

When General Embeddings Fall Short

Off-the-shelf embedding models (OpenAI text-embedding-3, Cohere embed, BGE, E5) capture broad semantic similarity well. But they can fail on:

Domain jargon: Internal abbreviations, product names, and technical terms that rarely appear in public training data.
Near-synonym distinctions: In law, "negligence" and "gross negligence" have very different legal consequences, but general embeddings may place them close together.
Entity-heavy corpora: When retrieval depends on matching specific entity names (drug names, part numbers, case IDs) rather than general meaning.
Multilingual enterprise jargon: Internal terms mixed across languages in international organizations.

The Decision Framework

Custom embeddings are justified when:

Evaluation shows repeated domain misses that chunking and reranking cannot fix.
The value of better retrieval exceeds the cost of training, serving, and migrating the index.
You have a trustworthy offline benchmark that reflects your real workload.

Red flag: Training a custom embedding model before building a trustworthy offline benchmark. You cannot measure improvement without a baseline.

Cost of Customization

Cost Factor	What It Involves
Data curation	Collecting query-document pairs with relevance labels
Training compute	GPU hours for fine-tuning (typically hours to days)
Re-indexing	Re-embedding the entire corpus with the new model
Threshold recalibration	Previous similarity thresholds no longer apply
Ongoing maintenance	Retraining as the domain evolves

See Topic 2: Domain Adaptation for the specific approaches to customization.

→ Custom embeddings are worth the effort only after cheaper relevance levers are measured and exhausted against a benchmark that reflects the real workload.

Python Example — Building an Offline Eval Benchmark

import json
from typing import List, Dict

def build_retrieval_benchmark(
    queries: List[str],
    relevant_docs: Dict[str, List[str]],
    corpus: List[str]
) -> Dict:
    """Build an offline benchmark for retrieval evaluation.

    Args:
        queries: list of real user queries
        relevant_docs: mapping query -> list of relevant doc IDs
        corpus: list of all documents
    """
    benchmark = {
        "queries": [],
        "corpus_size": len(corpus),
    }

    for query in queries:
        entry = {
            "query": query,
            "relevant": relevant_docs.get(query, []),
            "num_relevant": len(relevant_docs.get(query, [])),
        }
        benchmark["queries"].append(entry)

    # Save for reproducible evaluation
    with open("retrieval_benchmark.json", "w") as f:
        json.dump(benchmark, f, indent=2)

    print(f"Benchmark: {len(queries)} queries, "
          f"{len(corpus)} docs, "
          f"avg {sum(len(v) for v in relevant_docs.values())/len(queries):.1f} "
          f"relevant per query")
    return benchmark

Follow-up Questions

How many labeled examples do you need for a retrieval benchmark?

A useful benchmark requires at least 50–100 queries with relevance judgments. More is better, but even 50 well-chosen queries covering different query types and failure modes can reveal systematic gaps. The key is that the queries must reflect real user behavior, not synthetic patterns.

Can you use reranking instead of custom embeddings?

Often, yes. A cross-encoder reranker (e.g., Cohere Rerank, BGE-reranker) re-scores the top-k results from a general embedding model and frequently recovers most of the quality gap. Reranking is cheaper to deploy than re-embedding the entire corpus and should be tested before committing to custom training. See the optimization ladder above.

What is the difference between fine-tuning embeddings and training from scratch?

Fine-tuning starts from a pre-trained embedding model and adapts it with domain data, preserving general capabilities while adding domain specificity. Training from scratch builds a new model entirely from domain data, which requires far more data and compute. Fine-tuning is almost always the right choice unless your domain is radically different from any public text.

Domain Adaptation Approaches

Common approaches include continued pretraining on domain text, supervised contrastive training on labeled query-document pairs, hard-negative mining, and task-specific fine-tuning. The right approach depends on the amount and quality of supervision available.

💡 Domain adaptation is like teaching a translator who already knows the language but not the industry jargon. You give them domain documents and examples of correct usage until they learn the distinctions that matter.

Continued Pretraining

Continue the masked/contrastive pretraining objective on unlabeled domain text.

When: lots of domain text, no labeled pairs

Supervised Contrastive

Train on labeled (query, positive, negative) triplets with contrastive loss.

When: labeled query-document pairs available

Hard Negative Mining

Select negatives that are deceptively similar to force fine-grained learning.

When: basic model works but lacks precision

Task-Specific Fine-Tuning

Fine-tune for the exact retrieval or similarity objective of your pipeline.

When: clear downstream task (Q&A, search, dedup)

Driven by Errors You Can Name

Domain adaptation should be driven by retrieval errors you can name. If the system is missing exact domain distinctions, you need data and objectives that teach those distinctions explicitly. Common error patterns that drive adaptation:

Synonym collapse: "Tylenol" and "acetaminophen" should be identical; the model treats them as different.
False similarity: "myocardial infarction" and "myocardial inflammation" are retrieved interchangeably, but they are clinically different.
Jargon blindness: Internal terms like "P0 escalation" or "T2-weighted MRI" have no meaning to the general model.

Adaptation Strategy by Data Availability

Available Data	Best Approach	Expected Gain
Unlabeled domain text only	Continued pretraining (MLM/contrastive)	Moderate (domain vocabulary alignment)
50–500 labeled pairs	Few-shot fine-tuning with synthetic negatives	Moderate to high
500–10K labeled pairs	Supervised contrastive + hard negative mining	High
10K+ labeled pairs	Full fine-tuning with curriculum (easy → hard negatives)	Highest

Synthetic Data Generation

When labeled pairs are scarce, you can use an LLM to generate synthetic training data. Given a document chunk, ask a strong LLM to generate plausible queries that the chunk would answer. This "doc2query" approach can produce thousands of training pairs from an unlabeled corpus. The quality of synthetic data should always be validated against your benchmark. See Topic 3: Hard Negatives for how to pair these synthetic queries with effective negative examples.

→ Domain adaptation should be driven by retrieval errors you can name — if the system is missing exact domain distinctions, you need data and objectives that teach those distinctions explicitly.

Python Example — Generating Synthetic Training Pairs

from openai import OpenAI

client = OpenAI()

def generate_training_queries(document_chunk: str, n: int = 3):
    """Generate synthetic queries for a document chunk.

    This 'doc2query' approach creates training pairs
    when labeled data is scarce.
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Generate {n} diverse search queries that
this document chunk would be the ideal answer for.
Make queries realistic (how a user would actually search).
Return one query per line, no numbering.

Document chunk:
{document_chunk}"""
        }],
        temperature=0.8
    )

    queries = response.choices[0].message.content.strip().split("\n")
    # Each (query, chunk) pair becomes a positive training example
    return [{"query": q.strip(), "positive": document_chunk}
            for q in queries if q.strip()]

Follow-up Questions

How much domain text is needed for continued pretraining?

The amount depends on domain specificity. For highly specialized domains (legal, medical), 1–10 million tokens of domain text can show measurable improvement. For moderately specialized domains, you may need 50–100 million tokens. The key metric is whether domain-specific vocabulary and concepts appear frequently enough for the model to learn their relationships.

Does domain adaptation risk forgetting general capabilities?

Yes, catastrophic forgetting is a real risk. Aggressive fine-tuning on domain data can degrade performance on general queries. Mitigations include using a lower learning rate, mixing domain data with general data during training, and evaluating on both domain and general benchmarks. Some teams keep both the original and adapted models and route queries based on domain detection.

Can you adapt embedding models using RLHF or preference data?

Yes, preference-based training for embeddings is an emerging approach. Instead of explicit relevance labels, you use pairwise preferences ("Document A is more relevant than Document B for this query"). This is easier to collect from user click data or expert feedback and can produce models that better match real-world relevance judgments.

Hard Negatives

Hard negatives are non-relevant items that look deceptively similar to the query. They force the model to learn fine-grained distinctions instead of relying on superficial cues. Easy negatives teach separation; hard negatives teach precision.

💡 Easy negatives are like studying for a test with obviously wrong answers. Hard negatives are like studying with trick questions — they force you to really understand the material.

Query: "side effects of metformin in elderly patients"

Easy Negative

"The history of bicycle manufacturing in the 20th century."

Teaches: basic topic separation (medicine vs. manufacturing)

Hard Negative

"Metformin dosing guidelines for type 2 diabetes management in adult patients."

Teaches: distinguishing "side effects + elderly" from "dosing + adults" despite shared drug name

Why Hard Negatives Matter

Without hard negatives, the model learns an overly easy decision boundary. It can tell that a medical query should not return cooking recipes, but it cannot distinguish which of several relevant-looking medical documents is actually the right one. This is the difference between recall (finding the right neighborhood) and precision (finding the right house).

Mining Strategies

Strategy	How It Works	Difficulty Level
Random negatives	Sample random documents from the corpus	Easy (good for early training)
BM25 negatives	Top BM25 results that are not labeled relevant	Medium (keyword-similar but not relevant)
In-batch negatives	Other positives in the same training batch	Medium (topically related)
Embedding negatives	Nearest neighbors from a current embedding that are not relevant	Hard (semantically close but wrong)
LLM-generated	Ask an LLM to create plausible-but-wrong documents	Very hard (designed to confuse)

Curriculum: Easy to Hard

Best practice is to train with a curriculum: start with easy negatives so the model learns basic topical separation, then gradually introduce harder negatives as training progresses. This avoids the model being overwhelmed by difficult examples before it has learned basic distinctions.

See Topic 4: Training Losses for the loss functions that use these negatives during training.

→ Easy negatives teach the model to separate topics; hard negatives teach it to distinguish within topics — the latter drives the precision gains that matter in production retrieval.

Python Example — Mining Hard Negatives from Embeddings

import numpy as np
from sentence_transformers import SentenceTransformer

def mine_hard_negatives(
    queries, positives, corpus,
    model_name="BAAI/bge-base-en-v1.5",
    top_k=10, n_negatives=3
):
    """Mine hard negatives using current embedding model.

    For each query, find the top-k nearest corpus items
    that are NOT in the positive set.
    """
    model = SentenceTransformer(model_name)

    # Encode everything
    q_emb = model.encode(queries, normalize_embeddings=True)
    c_emb = model.encode(corpus, normalize_embeddings=True)

    # Compute cosine similarities
    sims = q_emb @ c_emb.T  # [n_queries, n_corpus]

    triplets = []
    for i, query in enumerate(queries):
        # Get indices sorted by similarity (descending)
        ranked = np.argsort(-sims[i])
        pos_set = set(positives[i])
        # Hard negatives: highest-similarity non-positives
        hard_negs = [
            corpus[idx] for idx in ranked
            if idx not in pos_set
        ][:n_negatives]

        for neg in hard_negs:
            triplets.append({
                "query": query,
                "positive": corpus[positives[i][0]],
                "negative": neg,
            })
    return triplets

Follow-up Questions

How many negatives per query should you use?

Typical practice is 3–7 hard negatives per query. More negatives per query generally help, but with diminishing returns beyond ~10. Some loss functions (like multiple negatives ranking loss) use all other in-batch samples as implicit negatives, which can provide hundreds of negatives per query without explicit mining.

Can hard negatives that are too hard hurt training?

Yes. If negatives are so close to the query that they are arguably relevant (false negatives), the model receives conflicting gradients. This is why human verification of the hardest negatives matters, and why curriculum training (easy to hard) is more stable than starting with only the hardest examples.

What is cross-encoder distillation for negative mining?

A cross-encoder (which processes query and document together) gives more accurate relevance scores than a bi-encoder (which embeds them separately). You can use a cross-encoder to score candidate negatives and keep only those that the cross-encoder confirms are truly non-relevant. This reduces false-negative noise in your training data.

Training Losses for Embedding Fine-Tuning

Contrastive, triplet, and multiple-negatives ranking losses are common because they directly optimize the geometry of relevant and non-relevant pairs in embedding space. The exact loss matters less than whether it aligns with the retrieval behavior you want.

💡 The training loss is like the scoring rubric for an exam. Different rubrics emphasize different skills. Choose the rubric that tests what matters for your use case.

Common embedding training losses and their geometric effects

Contrastive

↔

Pull positives together, push negatives apart by a margin

Triplet

△

Anchor closer to positive than negative by a margin

MNRL

↓

Softmax over in-batch similarities; positive should rank #1

Loss Functions Compared

Loss	Inputs	Key Property	Best For
Contrastive (Siamese)	Pairs + label	Fixed margin between positive/negative distances	Binary similarity (same/different)
Triplet	Anchor, positive, negative	Relative ordering: pos closer than neg	Fine-grained ranking with explicit negatives
Multiple Negatives Ranking (MNRL)	Anchor, positive (negatives from batch)	Softmax cross-entropy over batch	Large-batch training, no explicit negative mining
InfoNCE	Anchor, positive, N negatives	Contrastive with temperature scaling	Self-supervised and supervised contrastive learning
Cosine similarity	Pairs + continuous score	Direct regression on similarity score	Semantic textual similarity (STS)

Choose by Downstream Behavior

The loss should be evaluated through downstream ranking quality, not chosen because it is fashionable. Retrieval is the target behavior, so training should be judged by retrieval metrics (recall@k, MRR, NDCG). A loss that produces great STS scores but poor retrieval results is the wrong choice for a retrieval system.

Practical Recommendations

Start with MNRL: It is simple, uses in-batch negatives (no explicit mining needed), and works well with large batch sizes.
Add hard negatives with triplet/InfoNCE: Once the baseline works, add mined hard negatives (see Topic 3: Hard Negatives) to push precision.
Tune temperature: The temperature parameter in InfoNCE/MNRL controls how "sharp" the similarity distribution is. Lower temperature means stricter matching.

→ The training loss shapes the geometry of your embedding space — choose it based on downstream retrieval metrics, not theoretical elegance.

Python Example — Fine-Tuning with Sentence Transformers

from sentence_transformers import (
    SentenceTransformer, InputExample, losses
)
from torch.utils.data import DataLoader

# Load base model to fine-tune
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Prepare training data as (query, positive_doc) pairs
# MNRL uses in-batch negatives automatically
train_examples = [
    InputExample(texts=[
        "metformin side effects in elderly",
        "Common adverse effects of metformin in patients over 65..."
    ]),
    InputExample(texts=[
        "dosing guidelines for lisinopril",
        "Recommended starting dose of lisinopril is 10mg daily..."
    ]),
    # ... more (query, positive) pairs
]

# Use MNRL: in-batch negatives, no mining needed
train_dataloader = DataLoader(train_examples, batch_size=32,
                              shuffle=True)
train_loss = losses.MultipleNegativesRankingLoss(model)

# Fine-tune for 1 epoch
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=100,
    output_path="./domain-adapted-embedding",
)

Follow-up Questions

What batch size should you use for MNRL?

Larger batches provide more in-batch negatives, which improves training signal. Aim for 32–256 per GPU. If your GPU memory is limited, use gradient accumulation to simulate larger batches. Batch sizes below 16 provide too few negatives for MNRL to work well.

How does the temperature parameter affect training?

Temperature scales the logits before softmax. Lower temperature (e.g., 0.05) creates a sharper distribution, penalizing hard negatives more aggressively. Higher temperature (e.g., 0.1) creates a softer distribution, which is more forgiving of noisy labels. Start with the default (usually 0.05–0.07) and tune based on validation retrieval metrics.

Long Document Retrieval

Long documents are usually split into chunks because compressing an entire document into one vector often loses too much detail. Embed chunks for retrieval, then reconstruct document-level understanding from the relevant pieces.

💡 A single embedding for a book is like a one-sentence summary — it tells you the topic but not which chapter answers your question. Chunk-level embeddings are like an index.

The retrieval hierarchy: chunk for search, aggregate for understanding

📕

Full Document

Too long for a single embedding. Key details are averaged away.

↓ split ↓

📄

Chunks (200–512 tokens each)

Each chunk gets its own embedding. Preserves granularity for retrieval.

↓ retrieve top-k ↓

🔍

Retrieved Chunks

Top-k most relevant chunks returned to the LLM for reasoning.

↓ aggregate ↓

🧠

Document-Level Reasoning

LLM synthesizes across retrieved chunks, optionally re-ranking or expanding.

Why One Vector Per Document Fails

A 50-page document covers many subtopics. Compressing it into a single 768- or 1536-dimensional vector necessarily loses most of the specific content. The resulting embedding captures the document's general topic but cannot match queries about specific paragraphs, figures, or data points within it.

Chunking Strategies

Strategy	How It Works	Best For
Fixed-size token chunks	Split every N tokens with overlap	Simple, works for most text
Semantic chunking	Split at natural boundaries (paragraphs, sections, headings)	Structured documents
Recursive splitting	Try paragraph, then sentence, then token-level splits	Mixed-format documents
Sliding window	Overlapping windows ensure no content falls in a gap	Narrative text without clear sections

Chunk Size Trade-offs

Too small (50–100 tokens): Chunks lack context. The embedding captures a sentence fragment that may be ambiguous without surrounding text.
Too large (1000+ tokens): Chunks contain too many topics. The embedding averages across unrelated content, reducing retrieval precision.
Sweet spot (200–512 tokens): Large enough for context, small enough for specificity. Most production systems land here.

Add overlapping windows (e.g., 50-token overlap) to ensure content at chunk boundaries is not lost. Include document metadata (title, section heading) as a prefix to each chunk to improve embedding quality.

→ Embed chunks for retrieval, then reconstruct document-level understanding from the relevant pieces — this hierarchy is usually more effective than one-vector-per-document strategies.

Python Example — Semantic Chunking with Overlap

from typing import List
import tiktoken

def chunk_document(
    text: str,
    max_tokens: int = 400,
    overlap_tokens: int = 50,
    doc_title: str = "",
) -> List[str]:
    """Split a document into overlapping chunks with metadata prefix."""
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunk_text = enc.decode(chunk_tokens)

        # Prefix with document metadata for richer embedding
        if doc_title:
            chunk_text = f"[{doc_title}] {chunk_text}"

        chunks.append(chunk_text)

        # Advance by (max_tokens - overlap) to create overlap
        start += max_tokens - overlap_tokens

    print(f"Document: {len(tokens)} tokens -> {len(chunks)} chunks")
    return chunks

Follow-up Questions

Should you store parent document metadata with each chunk?

Yes. Store the document ID, title, section heading, and chunk position as metadata alongside each chunk embedding. This enables post-retrieval grouping (showing all chunks from the same document together) and contextual re-ranking. Many vector databases support metadata filtering, which lets you combine semantic search with structured filters.

What is hypothetical document embedding (HyDE)?

HyDE generates a hypothetical answer to the query using an LLM, then embeds that answer and searches for similar real documents. The intuition is that the hypothetical answer is closer in embedding space to the actual relevant documents than the short query is. It improves recall on short or ambiguous queries but adds latency from the LLM generation step.

How does late interaction (ColBERT) differ from standard chunk retrieval?

ColBERT stores per-token embeddings rather than a single vector per chunk. At query time, it computes maximum similarity between each query token and all document tokens. This preserves more granular matching information and often outperforms single-vector retrieval, but requires significantly more storage and compute.

Operations

Deploying, monitoring, and evolving retrieval systems in production — multilingual considerations, compression trade-offs, and model migration.

Multilingual Embedding Systems

Multilingual systems need representations that align related meaning across languages while preserving language-specific distinctions. Strong English performance does not guarantee strong cross-lingual retrieval — you must evaluate in each target language.

💡 A multilingual embedding is like a universal translator: it should place "heart attack" (English), "Herzinfarkt" (German), and "crise cardiaque" (French) near each other while keeping unrelated terms apart.

Same-Language Retrieval

Query in English retrieves English docs. Query in German retrieves German docs. Each language is a separate silo.

Simpler to evaluate; language-specific quality is paramount.

Cross-Language Retrieval

Query in English retrieves relevant docs in any language. Requires strong cross-lingual alignment in embedding space.

Harder; alignment quality varies across language pairs.

Key Considerations

Language coverage: Does the embedding model support all your target languages? Some models have strong coverage for European languages but weak coverage for low-resource languages (Thai, Swahili, Tagalog).
Script normalization: Different Unicode representations of the "same" character (e.g., full-width vs. half-width CJK) can produce different embeddings. Normalize before embedding.
Same-language vs. cross-language: These are different tasks with different failure modes. A model may excel at same-language French retrieval but poorly align French queries with English documents.
Evaluation per language: Build evaluation sets for each target language. A model reporting 95% recall in English may only achieve 70% recall in Korean.

Multilingual Embedding Models

Model	Languages	Strengths
multilingual-e5-large	100+	Strong cross-lingual retrieval, instruction-tuned
Cohere embed-multilingual	100+	Production API, good coverage
BGE-M3	100+	Multi-granularity (dense + sparse + ColBERT)
OpenAI text-embedding-3	Broad	Convenient API, adjustable dimensions

See Topic 2: Domain Adaptation for how to fine-tune multilingual models on domain-specific data in multiple languages.

→ Multilingual retrieval requires evaluation in each target language — never assume English performance transfers to other languages.

Python Example — Cross-Lingual Retrieval Test

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("intfloat/multilingual-e5-large")

# Test cross-lingual alignment:
# Query in English, documents in multiple languages
query = "query: What are the side effects of aspirin?"

docs = [
    "passage: Aspirin side effects include stomach bleeding.",  # EN
    "passage: Nebenwirkungen von Aspirin sind Magenblutungen.",  # DE
    "passage: Les effets secondaires de l'aspirine incluent...",  # FR
    "passage: The history of bicycle manufacturing.",  # EN irrelevant
]

# Encode with instruction prefix (required for E5)
q_emb = model.encode([query], normalize_embeddings=True)
d_emb = model.encode(docs, normalize_embeddings=True)

# Cosine similarities
sims = (q_emb @ d_emb.T)[0]
for doc, sim in sorted(zip(docs, sims), key=lambda x: -x[1]):
    print(f"  {sim:.3f}  {doc[:60]}...")
# Expected: all three aspirin docs rank above the bicycle doc

Follow-up Questions

How do you handle mixed-language documents?

Documents that mix languages (e.g., English with Japanese product names) are challenging. Embed the document as-is rather than splitting by language, so the model captures the natural language mixing. Ensure your embedding model was trained on multilingual data that includes code-switching. Test retrieval quality specifically on mixed-language content.

Should you use separate indexes per language?

For same-language retrieval, separate indexes can improve precision by eliminating cross-language noise. For cross-language retrieval, a single unified index is necessary. A hybrid approach maintains per-language indexes but also a unified index for cross-language queries, routing based on detected query intent.

Compression & Quantization

Compression and quantization reduce memory and improve speed, but they can slightly distort vector distances. In many systems the trade-off is worthwhile, especially when the recall loss is small compared with the operational gain.

💡 Quantization is like reducing a high-resolution photo to a thumbnail. You lose some fine detail, but it loads 10x faster, and for most purposes you can still tell what is in the picture.

Memory and recall trade-offs at different precision levels

FP32

4 bytes

per dimension

Baseline quality

FP16

2 bytes

per dimension

Negligible loss

INT8

1 byte

per dimension

~1% recall loss

Binary

1 bit

per dimension

~5–10% recall loss

Engineering Economics

The interview-safe answer is to frame this as engineering economics. You test whether a cheaper representation still preserves enough of the ranking signal to meet product goals. For a 1-billion-vector index:

Precision	Memory per 768-dim vector	Total for 1B vectors
FP32	3,072 bytes	~2.9 TB
FP16	1,536 bytes	~1.4 TB
INT8	768 bytes	~715 GB
Binary	96 bytes	~89 GB

Product Quantization (PQ)

Product quantization divides each vector into subvectors and quantizes each subvector independently using a learned codebook. This achieves higher compression ratios (32x–64x) than simple scalar quantization while preserving more ranking quality. It is the standard compression technique in FAISS and most production vector databases.

When to Compress

Always use FP16: The loss is negligible and the savings are free.
Use INT8 for large indexes: When memory is the bottleneck and 1% recall loss is acceptable.
Use binary for candidate generation: Fast approximate search to get top-1000, then re-rank with full-precision vectors.

See Topic 8: Similarity Thresholds for why thresholds must be recalibrated after changing precision.

→ Quantization is an engineering trade-off, not a quality question — test whether the cheaper representation preserves enough ranking signal for your product goals.

Python Example — FAISS Index with Product Quantization

import faiss
import numpy as np

dim = 768
n_vectors = 1_000_000

# Generate sample embeddings (replace with real embeddings)
vectors = np.random.randn(n_vectors, dim).astype("float32")
faiss.normalize_L2(vectors)  # normalize for cosine similarity

# Option 1: Flat index (exact, but uses full memory)
index_flat = faiss.IndexFlatIP(dim)
index_flat.add(vectors)
print(f"Flat: {index_flat.ntotal * dim * 4 / 1e9:.2f} GB")

# Option 2: IVF + PQ (compressed, approximate)
n_lists = 1024   # number of Voronoi cells
m = 48            # number of subquantizers
n_bits = 8        # bits per subquantizer
quantizer = faiss.IndexFlatIP(dim)
index_pq = faiss.IndexIVFPQ(quantizer, dim, n_lists, m, n_bits)
index_pq.train(vectors)
index_pq.add(vectors)
index_pq.nprobe = 32  # search 32 of 1024 cells

# Compare recall: search both indexes for same queries
queries = np.random.randn(100, dim).astype("float32")
faiss.normalize_L2(queries)
_, exact = index_flat.search(queries, 10)
_, approx = index_pq.search(queries, 10)

# Recall@10: fraction of true top-10 found by PQ
recall = np.mean([
    len(np.intersect1d(e, a)) / 10
    for e, a in zip(exact, approx)
])
print(f"PQ Recall@10: {recall:.1%}")

Follow-up Questions

What is Matryoshka embedding and how does it help compression?

Matryoshka Representation Learning trains embeddings where the first N dimensions form a useful embedding at any truncation point. You can use 256 dimensions instead of 768 with minimal quality loss, achieving ~3x compression by simply truncating. OpenAI's text-embedding-3 models support this via the dimensions parameter.

Can you combine quantization with dimensionality reduction?

Yes. A common pipeline is: Matryoshka truncation (768 -> 256 dims) followed by INT8 quantization (4 bytes -> 1 byte per dim). This gives ~12x total compression. You can even add PQ on top for extreme compression, though at that point recall testing is essential.

Similarity Thresholds

Thresholds should be chosen from validation data, not intuition. The right threshold depends on the embedding model, the corpus, and what happens downstream when retrieval is too broad or too narrow.

💡 A similarity threshold is like a bouncer at a club. Too strict and you turn away good guests. Too loose and the room fills with people who should not be there. The right strictness depends on what is happening inside.

Threshold Too Low

Many irrelevant results pass through. LLM gets flooded with weak context.

Risk: hallucination from noise

Threshold Just Right

Relevant results pass; marginal results are filtered. LLM gets focused context.

Optimal: precision + recall balanced

Threshold Too High

Many relevant results rejected. LLM lacks sufficient context to answer.

Risk: "I don't have enough info"

Thresholds Are Pipeline Policy

Thresholds should be tuned jointly with reranking, answer generation, and abstention behavior. A threshold that maximizes offline recall may still hurt answer quality if it floods the model with weak context. The right threshold is the one that produces the best end-to-end answer quality, not the best retrieval metrics in isolation.

How to Set Thresholds

Collect validation data: Get query-document pairs with relevance labels (relevant / not relevant).
Compute similarity scores: Embed all queries and documents, compute cosine similarity for each pair.
Plot precision-recall curve: Vary the threshold and measure precision and recall at each point.
Choose based on downstream impact: A threshold that maximizes F1 may not be ideal if false positives (noise context) are much more harmful than false negatives (missing context).

What Invalidates a Threshold

Change	Why Threshold Needs Recalibration
New embedding model	Different models produce different similarity distributions
Corpus changes	Adding documents shifts the similarity landscape
Query pattern shifts	Users asking different types of questions
Quantization/compression	Compression distorts distances, changing effective thresholds

See Topic 9: Monitoring Retrieval Drift for how to detect when thresholds need adjustment, and Topic 7: Compression & Quantization for compression-induced threshold shifts.

→ Thresholds are part of the pipeline policy — tune them on validation data jointly with reranking, generation, and abstention behavior, and recalibrate after any component changes.

Python Example — Finding the Optimal Threshold

import numpy as np
from sklearn.metrics import precision_recall_curve

def find_optimal_threshold(
    similarities: np.ndarray,
    labels: np.ndarray,
    beta: float = 1.0
) -> dict:
    """Find the threshold that maximizes F-beta score.

    Args:
        similarities: cosine similarities for each (query, doc) pair
        labels: 1 = relevant, 0 = not relevant
        beta: F-beta weight (beta > 1 favors recall,
              beta < 1 favors precision)
    """
    precisions, recalls, thresholds = precision_recall_curve(
        labels, similarities
    )

    # Compute F-beta at each threshold
    f_scores = (
        (1 + beta**2) * precisions * recalls /
        (beta**2 * precisions + recalls + 1e-8)
    )

    best_idx = np.argmax(f_scores)
    return {
        "threshold": float(thresholds[best_idx]),
        "precision": float(precisions[best_idx]),
        "recall": float(recalls[best_idx]),
        "f_score": float(f_scores[best_idx]),
    }

Follow-up Questions

Should you use a fixed threshold or a dynamic one?

For most systems, a fixed threshold per model/corpus is simpler and more predictable. Dynamic thresholds (e.g., "take any result within 90% of the best match") can handle score variation better but add complexity. Some teams use a fixed minimum threshold combined with a dynamic "relative-to-best" filter for robustness.

What happens when retrieval returns zero results above the threshold?

Design for this case explicitly. Options include: falling back to keyword search, lowering the threshold temporarily, generating a response acknowledging insufficient context, or asking the user to rephrase. The worst option is silently returning nothing and leaving the LLM to generate an answer without any retrieved context.

Monitoring Retrieval Drift

Monitor query distributions, nearest-neighbor patterns, recall on canary sets, click/acceptance behavior, and the rate of irrelevant contexts reaching the generator. Retrieval systems need ongoing observation because their environment changes even when the model does not.

💡 A retrieval system is like a map. The map does not change, but the territory does — new roads, new buildings, closed paths. Without regular updates to check the map against reality, you will send people to the wrong places.

Drift signals to monitor in production retrieval systems

Query Distribution

Are users asking different types of questions than when the system was built?

Nearest-Neighbor Patterns

Are average similarity scores for top-k results declining?

Canary Set Recall

Do known good query-document pairs still retrieve correctly?

Click/Acceptance Rate

Are users clicking on, using, or accepting retrieved results less often?

Irrelevant Context Rate

How often does the LLM receive and try to use irrelevant retrieved context?

New Term Frequency

Are queries using terms that did not exist when the model was trained?

Sources of Drift

Drift can come from changes in user language, data ingestion, new product terms, or updated business processes. The embedding model remains frozen, but the world it represents keeps changing:

User language shifts: New jargon, trending terms, or changes in how users phrase queries.
Corpus evolution: New documents added, old documents removed or updated. The embedding index may not reflect current reality.
Business process changes: A reorganization or product rename means old queries no longer map to the right documents.
Seasonal patterns: Query patterns shift with business cycles (tax season, product launches, regulatory deadlines).

Building a Monitoring Pipeline

Component	What to Track	Alert When
Canary queries	Recall on a fixed set of known-good pairs	Recall drops below baseline
Score distribution	Mean/median/p95 of top-k similarity scores	Distribution shifts significantly
User feedback	Thumbs up/down, click-through, answer acceptance	Satisfaction trends downward
LLM abstention rate	How often the LLM says "I don't have enough info"	Rate increases (may indicate retrieval gaps)
Empty result rate	Queries returning zero results above threshold	Rate increases

See Topic 8: Similarity Thresholds for when to recalibrate thresholds as drift is detected.

→ Representation quality is not a one-time achievement — retrieval systems need ongoing observation because their environment changes even when the model does not.

Python Example — Canary Set Monitoring

import json, time
from typing import List, Dict

def run_canary_check(
    search_fn,
    canary_set: List[Dict],
    k: int = 10,
) -> Dict:
    """Check retrieval quality against known-good pairs.

    Args:
        search_fn: function(query) -> list of doc_ids
        canary_set: [{"query": str, "expected_doc_id": str}, ...]
        k: number of results to check
    """
    hits = 0
    scores = []
    failures = []

    for canary in canary_set:
        results = search_fn(canary["query"], top_k=k)
        result_ids = [r["id"] for r in results]

        if canary["expected_doc_id"] in result_ids:
            hits += 1
            rank = result_ids.index(canary["expected_doc_id"]) + 1
            scores.append(1.0 / rank)  # reciprocal rank
        else:
            failures.append(canary["query"])
            scores.append(0.0)

    result = {
        "timestamp": time.time(),
        "recall_at_k": hits / len(canary_set),
        "mrr": sum(scores) / len(scores),
        "failures": failures,
    }

    # Alert if recall drops below threshold
    if result["recall_at_k"] < 0.9:
        print(f"ALERT: Canary recall dropped to "
              f"{result['recall_at_k']:.1%}")
    return result

Follow-up Questions

How often should you run canary checks?

Run canary checks at least daily for production systems and after every corpus update. Some teams run them hourly or as part of their CI/CD pipeline. The checks are cheap (a few dozen queries against the index) and provide early warning of degradation before users notice.

How do you distinguish drift from a bug?

Drift is gradual and affects many queries. A bug is sudden and may affect specific query patterns. If canary recall drops sharply after a deployment, suspect a bug (wrong model loaded, index corruption, config change). If it drifts slowly over weeks, suspect genuine query or corpus drift. Check deployment logs alongside monitoring data.

When should drift trigger re-training vs. re-indexing?

Re-indexing (re-embedding new/changed documents) is needed when the corpus changes. Re-training the embedding model is needed when the types of queries or the domain language itself have shifted. Re-indexing is cheaper and more common. Re-training is a bigger investment and should be driven by clear benchmark evidence.

Embedding Model Migration

Model migration usually requires re-embedding the corpus, validating new retrieval behavior, and potentially recalibrating thresholds and rerankers. During the transition, many teams dual-run both indexes so they can compare results and de-risk rollout.

💡 Changing the embedding model is not like swapping a battery. It is like changing the coordinate system on a map — every point must be recalculated, and the scale may have shifted.

Migration phases — think operationally, not just technically

Benchmark on Current System

Measure baseline retrieval quality with the existing model. This is your comparison point.

Re-Embed the Corpus

Embed all documents with the new model. For large corpora, this can take hours to days and significant compute.

Validate Retrieval Quality

Run the full benchmark suite against the new index. Compare recall, precision, MRR, and end-to-end answer quality.

Recalibrate Thresholds & Rerankers

Similarity distributions differ between models. Old thresholds will not work correctly.

Dual-Run & Gradual Rollout

Run both old and new indexes in parallel. Route a percentage of traffic to the new index and compare results before full cutover.

Not Just Swapping an API Call

Changing the embedding model is not just swapping one API call. It is a data migration and quality management exercise. Vectors from different models live in incompatible spaces — you cannot search a new-model query against an old-model index. Every document must be re-embedded.

Migration Checklist

Step	Details	Common Pitfall
Baseline benchmark	Measure current recall, MRR, NDCG on eval set	Migrating without a baseline to compare against
Corpus re-embedding	Embed all documents; budget compute and time	Underestimating re-embedding time for large corpora
Threshold recalibration	Find new optimal thresholds on validation data	Reusing old thresholds (different score distributions)
Reranker compatibility	Test if existing reranker works with new embeddings	Assuming reranker is model-agnostic when it is not
A/B testing	Route 5–10% of traffic to new index, measure user impact	Big-bang cutover without shadow comparison
Rollback plan	Keep old index available for quick revert	Deleting old index before verifying new one

Dual-Run Strategy

The safest migration pattern is to dual-run both indexes during transition:

Query both indexes for every request.
Serve results from the old index (no user impact).
Log and compare results from both indexes offline.
When the new index consistently matches or exceeds the old one, switch traffic gradually.
Keep the old index available for rollback for 1–2 weeks after full cutover.

See Topic 8: Similarity Thresholds and Topic 9: Monitoring Retrieval Drift for the calibration and monitoring that must accompany migration.

→ Embedding model migration is a data migration and quality management exercise — plan for re-embedding, recalibration, dual-running, and rollback.

Python Example — Dual-Index Comparison

from typing import List, Dict, Callable
import json

def compare_indexes(
    query: str,
    old_search: Callable,
    new_search: Callable,
    k: int = 10,
) -> Dict:
    """Compare retrieval results from old and new embedding indexes.

    Returns overlap metrics and both result sets for analysis.
    """
    old_results = old_search(query, top_k=k)
    new_results = new_search(query, top_k=k)

    old_ids = [r["id"] for r in old_results]
    new_ids = [r["id"] for r in new_results]

    # Measure overlap between result sets
    overlap = len(set(old_ids) & set(new_ids))

    # Rank correlation (Jaccard similarity of top-k)
    jaccard = overlap / len(set(old_ids) | set(new_ids))

    # Top-1 agreement
    top1_match = old_ids[0] == new_ids[0] if old_ids and new_ids else False

    return {
        "query": query,
        "overlap_at_k": overlap / k,
        "jaccard": jaccard,
        "top1_match": top1_match,
        "old_top3": old_ids[:3],
        "new_top3": new_ids[:3],
    }

# Run over benchmark queries and log for analysis
# Look for patterns in disagreements to understand
# where the new model differs from the old one

Follow-up Questions

How long does re-embedding a large corpus typically take?

For a corpus of 1 million documents at ~500 tokens each, re-embedding takes roughly 2–8 hours on a single GPU, or 30–60 minutes with an API (depending on rate limits). For 100M+ documents, plan for days of compute or significant API cost. Budget this into your migration timeline and consider parallelizing across multiple GPUs or API keys.

What if the new model is a different dimensionality?

Different dimensionalities require a new index entirely — you cannot mix 768-dim and 1536-dim vectors in the same FAISS index. Some models (OpenAI text-embedding-3) support Matryoshka truncation to match the old dimensionality, but this should be tested for quality impact. Plan for a full index rebuild.

How do you handle migration for real-time systems with zero downtime?

The dual-index pattern handles this naturally. Build the new index offline while the old index serves production traffic. Once the new index is ready and validated, add it as a secondary search target. Route traffic gradually (5% -> 25% -> 100%). The old index stays live until you are confident in the new one. This requires sufficient infrastructure to run both indexes simultaneously.