Ch 3: Embeddings & Semantic Representations

Foundations

What embeddings are, how they enable similarity-based retrieval, and the metrics that make comparison possible.

What Is an Embedding?

An embedding is a dense numerical vector that represents a token, sentence, or document in a way that preserves semantic relationships. Distance in embedding space encodes similarity in meaning.

💡 An embedding is a GPS coordinate for meaning — similar ideas get nearby coordinates, distant ideas land far apart.

Raw Text

→

Embedding Model

→

Vector [0.31, 0.22, ...]

→

Similarity Search

The embedding pipeline: text becomes geometry, and geometry enables retrieval.

From Words to Vectors

An embedding maps a discrete object — a word, a sentence, an image — into a continuous vector space where geometric relationships encode semantic ones. Instead of treating text as an ID lookup, the model projects it into a space where distance and direction carry meaning.

This is what makes embeddings central to modern AI engineering. They let machines compare meaning without relying only on exact keyword matches. A query about "physician salary growth" can retrieve content about "doctor compensation trends" because the vectors are close, even though no words overlap.

How Embeddings Are Learned

Embedding vectors are not hand-crafted. They are learned during model training by optimizing an objective that pushes related items closer together and unrelated items apart. The training objective determines what "similarity" means:

Contrastive learning: Pairs of related texts are pushed together; random negatives are pushed apart.
Masked language modeling: Predicting missing tokens forces the model to encode contextual meaning.
CLIP-style training: Images and their captions are aligned in a shared space, enabling cross-modal retrieval.

Why This Matters in Interviews

A strong interview answer explains both levels: embeddings are learned numerical representations, and their value comes from turning semantic comparison into vector operations. Avoid treating all embedding models as interchangeable — mention task mismatch, domain drift, and the difference between retrieval quality and generation quality. See Topic 3: Embedding Levels for how scope affects which embedding to use.

→ Embeddings turn meaning into measurable geometry — they are not just "fancy vectors from an API" but learned representations optimized for specific tasks.

Python Example

from math import sqrt

# A minimal cosine similarity function
# Works on any two equal-length vectors
def cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    na = sqrt(sum(x * x for x in a))
    nb = sqrt(sum(y * y for y in b))
    return dot / (na * nb)

# Simulate two document embeddings (3-dimensional for clarity)
query  = [0.30, 0.22, 0.91]
doc_a  = [0.28, 0.20, 0.89]  # semantically similar
doc_b  = [0.10, 0.77, 0.14]  # semantically different

print("query vs doc_a:", round(cosine(query, doc_a), 4))
print("query vs doc_b:", round(cosine(query, doc_b), 4))

Follow-up Questions

How many dimensions do production embeddings typically have?

Modern embedding models produce vectors with 256 to 3072 dimensions. OpenAI's text-embedding-3-large uses 3072, while smaller models like all-MiniLM use 384. Higher dimensions capture more nuance but cost more to store and search. See Topic 9: Dimension Trade-offs for the full trade-off analysis.

Can embeddings capture negation and other subtle meanings?

Partially. Simple embedding models struggle with negation ("the food was good" vs. "the food was not good" can produce similar vectors). More sophisticated models trained with hard negatives handle this better, but no embedding model is perfect at negation. This is one reason cross-encoders are used for reranking.

Are embeddings the same as a model's internal hidden states?

No. Internal hidden states change at every layer and serve the model's own processing. Embedding vectors for retrieval are typically produced by a dedicated embedding model or by pooling the final hidden states of an encoder. They are designed for external comparison, not internal computation.

Why Embeddings Enable Semantic Search

Semantic search works because relevant items do not need to share the exact same words — if their embeddings land close together in vector space, they are retrieved as matches.

💡 Keyword search is like looking up a word in a dictionary index. Semantic search is like asking a librarian who understands what you mean.

Beyond Keyword Matching

Traditional search engines like BM25 rely on term frequency and exact keyword overlap. If a user searches for "physician salary growth" but the document says "doctor compensation trends," keyword search misses it entirely. Semantic search bridges this gap because both phrases produce similar embedding vectors.

This makes embeddings especially useful for question answering, support search, and long-tail user phrasing where users rarely phrase queries the same way the content was written.

The Caveat: False Proximity

Dense similarity can also retrieve conceptually adjacent but wrong content. A query about "bank interest rates" might retrieve documents about "river banks" if the embedding model is not domain-tuned. This is why production systems often add:

Metadata filters — restrict results by category, date, or source.
Lexical matching — combine dense retrieval with BM25 for hybrid search. See Topic 7: Dense vs Sparse.
Reranking — use a cross-encoder to re-score the shortlist. See Topic 8: Bi-Encoder vs Cross-Encoder.

Interview Signal

The strongest answer treats embedding search as a probabilistic relevance stage, not a guaranteed truth mechanism. Mention that recall is high but precision requires downstream filtering.

→ Semantic search is powerful because it matches meaning, not words — but it is not automatically precise, which is why production pipelines add filters, reranking, and hybrid retrieval.

Python Example

import numpy as np

# Simulate embeddings for a query and candidate documents
query_vec = np.array([0.8, 0.1, 0.6])

docs = {
    "doctor compensation trends":   np.array([0.78, 0.12, 0.58]),
    "nurse hiring pipeline":        np.array([0.55, 0.30, 0.45]),
    "river bank erosion rates":     np.array([0.10, 0.85, 0.05]),
}

# Rank by cosine similarity
for title, vec in docs.items():
    sim = np.dot(query_vec, vec) / (
        np.linalg.norm(query_vec) * np.linalg.norm(vec)
    )
    print(f"{sim:.3f}  {title}")

Follow-up Questions

How do you handle queries with very different phrasing from the corpus?

Use embedding models explicitly trained for asymmetric retrieval (short query vs. long document). Models like E5 and BGE are trained with query-document pairs rather than symmetric sentence pairs, which significantly improves retrieval when phrasing differs.

What is hybrid search and when should you use it?

Hybrid search combines dense (embedding) retrieval with sparse (BM25/keyword) retrieval and merges the result lists. Use it when your corpus contains exact terms that must not be missed (product codes, legal citations, medical terms) alongside natural language content.

Does the embedding model need to be fine-tuned for every domain?

Not always, but domain-specific fine-tuning typically improves recall by 10-30% for specialized corpora. General-purpose models work well for broad content, but domains like legal, medical, or code search benefit substantially from fine-tuning on in-domain data.

Token, Sentence, and Document Embeddings

Token embeddings represent individual tokens at the input layer. Sentence embeddings compress a whole sentence into one vector for comparison. Document embeddings do the same at larger scope. Match the representation level to the task.

💡 Token embeddings are individual puzzle pieces. Sentence embeddings are the assembled picture. Document embeddings are the whole puzzle box.

Token Embeddings

Sentence Embedding

Document Embedding

Three Levels of Representation

Level	Scope	Typical Use	Training Signal
Token	Single subword	Internal model processing	Language modeling objective
Sentence	One sentence or chunk	Semantic search, clustering	Contrastive pairs, NLI
Document	Full document or page	Document retrieval, classification	Pooling or chunk aggregation

Why This Distinction Matters

Token embeddings are part of the model's internal processing and are not usually used directly for search. They represent individual token identities, not semantic meaning at the sentence level. For search and clustering, you want sentence or chunk embeddings explicitly trained to preserve semantic similarity at that level.

Sentence-BERT (Reimers & Gurevych, 2019) made sentence-level similarity search dramatically more efficient than pairwise cross-encoding by producing fixed-size vectors that can be precomputed and indexed.

Choosing the Right Level

Retrieval-augmented generation: Sentence or chunk embeddings (typically 256-512 tokens per chunk).
Document classification: Document-level embeddings or pooled chunk embeddings.
Named entity disambiguation: Token-level contextual embeddings from a fine-tuned encoder.

→ Match the embedding level to the task. Token embeddings power internal model mechanics; sentence and document embeddings power retrieval and comparison.

Python Example

# Sentence embedding via mean pooling (simplified)
import numpy as np

# Simulated token embeddings for "The cat sat"
token_embs = np.array([
    [0.1, 0.5, 0.3],  # "The"
    [0.8, 0.2, 0.9],  # "cat"
    [0.4, 0.6, 0.1],  # "sat"
])

# Mean pooling: average all token vectors into one
sentence_emb = token_embs.mean(axis=0)
print("Sentence embedding:", sentence_emb)

# Normalize to unit length for cosine similarity
sentence_emb = sentence_emb / np.linalg.norm(sentence_emb)
print("Normalized:", sentence_emb)

Follow-up Questions

What is mean pooling vs. CLS pooling?

Mean pooling averages all token embeddings into one vector. CLS pooling uses the special [CLS] token's embedding as the sentence representation. Mean pooling generally outperforms CLS pooling for retrieval tasks because it captures information from all tokens, not just one.

How do you embed documents longer than the model's context window?

Split the document into overlapping chunks (e.g., 512 tokens with 50-token overlap), embed each chunk separately, then either store all chunk vectors or aggregate them via averaging. Late chunking approaches process the full document first, then pool representations per chunk for better coherence.

Can multimodal embeddings mix text and images in the same space?

Yes. Models like CLIP (Radford et al., 2021) project both text and images into a shared embedding space, enabling cross-modal retrieval: search images with text queries or find similar images to a caption. The key is that the training objective aligns paired modalities.

Why Engineers L2-Normalize Embeddings

Normalization scales vectors to unit length so similarity depends on direction rather than magnitude. This makes cosine similarity and dot product behave consistently and improves retrieval stability.

💡 Normalization is like adjusting all speakers to the same volume — you compare what they say, not how loud they are.

What Normalization Does

L2 normalization divides each vector by its Euclidean norm (length), producing a unit vector that lies on the surface of a hypersphere. After normalization, every vector has length 1.0, so cosine_similarity(a, b) = dot_product(a, b).

Without normalization, one vector can dominate comparisons because of size rather than meaning. A document embedding with large activations would score higher in dot-product search even if a smaller-norm vector is semantically closer.

When to Normalize

Always normalize when the embedding model documentation recommends it (most do).
Always normalize when using a vector index that computes dot product (FAISS, Pinecone inner product mode).
Skip normalization only when magnitude carries meaningful signal (e.g., some models encode confidence in vector length).

Practical Impact

Normalization simplifies index behavior and thresholding. With unit vectors, a cosine score of 0.9 always means the same level of similarity regardless of which documents are compared. Without normalization, thresholds become unreliable because scores depend on both direction and magnitude. See Topic 5: Cosine vs Dot Product for how normalization unifies these two metrics.

→ Normalization is not magic; it is a design choice that makes similarity scores comparable and thresholds reliable.

Python Example

import numpy as np

# Two vectors with different magnitudes
a = np.array([3.0, 4.0])
b = np.array([0.6, 0.8])

# Before normalization: same direction, different norms
print("Norm a:", np.linalg.norm(a))   # 5.0
print("Norm b:", np.linalg.norm(b))   # 1.0
print("Dot product:", np.dot(a, b))  # 5.0 (misleading!)

# After normalization: both unit length
a_n = a / np.linalg.norm(a)
b_n = b / np.linalg.norm(b)
print("Dot (normalized):", np.dot(a_n, b_n))  # 1.0 (identical direction)

Follow-up Questions

Does normalization lose information?

Yes, it discards magnitude information. If vector length encodes meaningful signal (e.g., document importance or model confidence), normalization removes it. For most retrieval tasks, direction is what matters, so the trade-off is worthwhile.

Should I normalize before or after storing in the vector database?

Normalize before inserting into the index, and normalize queries at search time too. Most vector databases offer a cosine similarity mode that normalizes internally, but pre-normalizing and using inner product mode is slightly faster and gives you explicit control.

What happens if I mix normalized and unnormalized vectors?

Comparisons become unreliable. An unnormalized vector with high magnitude will dominate dot-product rankings regardless of semantic relevance. Always ensure consistency: either all vectors are normalized or none are, and use the matching similarity metric.

Cosine Similarity vs Dot Product

Cosine similarity measures the angle between vectors (direction only). Dot product mixes angle and magnitude. Use cosine when norms vary; after L2 normalization, they are equivalent.

💡 Cosine asks "are they pointing the same way?" Dot product asks "are they pointing the same way AND how strongly?"

Query

Doc A (similar direction)

Doc B (different direction)

When Cosine Is Safer

Use cosine similarity when:

Embedding norms vary across examples (different document lengths, mixed domains).
The model documentation recommends cosine-based retrieval.
You want ranking behavior that is less sensitive to scale differences from training or preprocessing.

When They Are Equivalent

If embeddings are Topic 4: L2-normalized, cosine similarity and dot product produce identical rankings because every vector has unit length. In production, the real rule is consistency: use the metric your embedding model, vector index, and offline evaluation pipeline are designed around.

The Mismatch Trap

A metric mismatch can quietly change retrieval quality even though the embeddings themselves never changed. If you train with cosine similarity but search with dot product (or vice versa), ranking order can shift. This is a common source of "nothing changed but recall dropped" bugs in production systems.

→ After normalization, cosine and dot product are equivalent. The real rule is consistency across training, indexing, and evaluation.

Python Example

import numpy as np

a = np.array([0.8, 0.6])
b = np.array([0.3, 0.9])

# Cosine similarity: angle-only comparison
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print("Cosine:", round(cosine, 4))

# Dot product: mixes angle AND magnitude
dot = np.dot(a, b)
print("Dot product:", round(dot, 4))

# After normalization, they match
a_n = a / np.linalg.norm(a)
b_n = b / np.linalg.norm(b)
print("Normalized dot:", round(np.dot(a_n, b_n), 4))
print("Same as cosine:", round(cosine, 4) == round(np.dot(a_n, b_n), 4))

Follow-up Questions

What about Euclidean distance?

Euclidean distance measures the straight-line distance between vector endpoints. For normalized vectors, Euclidean distance is monotonically related to cosine similarity, so they produce identical rankings. Cosine is preferred in practice because its 0-to-1 range is more intuitive for thresholding.

Which metric is fastest for retrieval at scale?

Inner product (dot product) with pre-normalized vectors is the fastest option because it avoids the norm computation at query time. Most production vector databases (FAISS, Milvus, Pinecone) optimize heavily for inner product search with HNSW or IVF indexes.

System Design

How embedding choices affect real systems — space health, retrieval architecture, dimensionality trade-offs, and evaluation strategy.

Hubness and Anisotropy

Hubness means some vectors appear as nearest neighbors for too many queries. Anisotropy means the space is unevenly distributed. Both degrade retrieval quality by biasing results toward generic items.

💡 Hubness is like one popular restaurant showing up on every "nearby food" search regardless of cuisine. Anisotropy is like all restaurants being on the same street.

Each cell is a vector. Brighter = more times it appears as a nearest neighbor (hub). Click to regenerate.

What Goes Wrong

Hubness occurs when certain vectors become "universal neighbors" — they appear in the top-k results for a disproportionate number of queries. This happens more often in high-dimensional spaces and is a known mathematical phenomenon, not a bug in the index.

Anisotropy means embeddings cluster into a narrow cone rather than spreading across the hypersphere. When most vectors point in roughly the same direction, cosine similarity scores become very high and very similar, making it hard to distinguish truly relevant results from noise.

Diagnosing These Issues

Hub detection: Count how often each indexed vector appears in top-k results across a sample of queries. A heavy-tailed distribution signals hubness.
Isotropy score: Measure the uniformity of embedding directions. Principal component analysis on a sample of embeddings reveals whether variance is concentrated in few dimensions.

Mitigation Strategies

Strategy	How It Helps
Better fine-tuning	Hard negatives and diverse training data spread embeddings more evenly
Whitening / PCA	Decorrelates dimensions and reduces anisotropy
Reranking	A cross-encoder re-scores results and demotes hub items
Normalization	Reduces magnitude-based hubness but does not fix directional clustering

→ Not all embedding spaces are equally healthy. If you see over-retrieval of generic documents, investigate hubness and anisotropy before blaming the vector database.

Python Example

import numpy as np

# Generate 200 random embeddings in 64 dimensions
np.random.seed(42)
embeddings = np.random.randn(200, 64)

# Normalize to unit length
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / norms

# Compute pairwise cosine similarity (= dot product after norm)
sims = embeddings @ embeddings.T

# Count how often each vector appears in top-5 for others
hub_counts = np.zeros(200)
for i in range(200):
    top5 = np.argsort(sims[i])[-6:-1]  # exclude self
    hub_counts[top5] += 1

print("Max hub count:", int(hub_counts.max()))
print("Mean hub count:", hub_counts.mean())

Follow-up Questions

Does hubness get worse in higher dimensions?

Yes. Hubness is a well-documented curse of dimensionality phenomenon. As dimensionality increases, the distribution of nearest-neighbor distances becomes more concentrated, making some points statistically more likely to be neighbors for many queries. Dimensionality reduction can help, but at the cost of representational capacity.

Can you fix anisotropy after training?

Partially. Post-hoc whitening (applying a learned linear transform) can redistribute embeddings more uniformly. However, the best fix is improving the training process itself with better negative sampling and contrastive objectives that encourage isotropy from the start.

Dense vs Sparse Representations

Dense embeddings are continuous vectors where most dimensions carry nonzero values, capturing semantic similarity. Sparse representations are high-dimensional with few active dimensions, preserving exact term evidence. Production systems often combine both.

💡 Dense search finds things that mean the same thing. Sparse search finds things that say the same thing. Hybrid search does both.

Dense (768-dim)

Sparse (30000-dim)

Dense: all dimensions active. Sparse: only a few terms have nonzero weights.

The Recall-Precision Trade-off

Dense methods capture semantic similarity better — "car" and "automobile" produce similar vectors even though they share no characters. Sparse methods like BM25 preserve exact term evidence better — they will not miss a document containing the exact product code or legal citation you searched for.

Hybrid Retrieval

Because each approach has complementary strengths, hybrid retrieval has become the default in many enterprise systems. The typical pattern:

Dense search: Retrieve top-k candidates by embedding similarity.
Sparse search: Retrieve top-k candidates by BM25 keyword matching.
Merge: Combine result lists using reciprocal rank fusion (RRF) or weighted scoring.
Rerank: Use a Topic 8: cross-encoder to re-score the merged shortlist.

When to Use Each

Scenario	Best Approach	Why
Natural language Q&A	Dense or hybrid	Users paraphrase; semantic matching needed
Product code lookup	Sparse or hybrid	Exact match is critical
Legal document search	Hybrid	Both terminology precision and conceptual similarity matter
Internal knowledge base	Hybrid with metadata	Documents vary in length, style, and domain

→ Dense retrieval captures meaning; sparse retrieval captures exact terms. Hybrid retrieval is the production default because neither alone is sufficient.

Python Example

# Reciprocal Rank Fusion: merge dense and sparse results
def rrf_merge(dense_ranks, sparse_ranks, k=60):
    """Merge two ranked lists using RRF scoring."""
    scores = {}
    for rank, doc_id in enumerate(dense_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
    for rank, doc_id in enumerate(sparse_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
    # Sort by combined score, descending
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

dense  = ["doc_3", "doc_1", "doc_7", "doc_5"]
sparse = ["doc_1", "doc_9", "doc_3", "doc_2"]
print(rrf_merge(dense, sparse))

Follow-up Questions

What is learned sparse retrieval (e.g., SPLADE)?

SPLADE and similar models learn sparse representations where the model decides which vocabulary terms to activate and with what weight. This combines the efficiency of sparse indexing with learned semantic expansion, often outperforming pure BM25 while maintaining interpretability.

How much does hybrid retrieval improve over dense-only?

Benchmarks typically show 5-15% improvement in recall@10 from adding sparse retrieval to dense. The gain is largest when the corpus contains specialized terminology, codes, or proper nouns that dense models may not encode precisely.

Bi-Encoder vs Cross-Encoder

A bi-encoder embeds query and document independently for fast, scalable retrieval. A cross-encoder processes them together for richer but slower comparison. The standard pattern: bi-encoder for recall, cross-encoder for reranking.

💡 The bi-encoder is a fast librarian who pulls a shortlist from the catalog. The cross-encoder is a careful reviewer who reads each shortlisted item alongside your question.

Bi-Encoder (Stage 1: Recall)

Query

→

Encoder

→

vec_q

↔

vec_d

←

Encoder

←

Doc

Cross-Encoder (Stage 2: Rerank)

Query

Doc

→

Joint Encoder

→

Score: 0.93

Speed vs Quality Trade-off

Property	Bi-Encoder	Cross-Encoder
Encoding	Independent (query + doc separate)	Joint (query + doc together)
Speed	Fast — vectors precomputed	Slow — must process each pair
Scalability	Millions of documents	Tens to hundreds of candidates
Quality	Good recall	Better precision
Use case	First-stage retrieval	Second-stage reranking

The Two-Stage Pipeline

The standard production pattern combines both:

Stage 1 (Recall): A bi-encoder retrieves the top 100-1000 candidates from the full corpus using precomputed vectors and ANN search.
Stage 2 (Precision): A cross-encoder re-scores the top 20-50 candidates by processing each query-document pair jointly, then returns the final top-k.

This gives you the speed of vector search with the quality of cross-attention, at manageable cost.

→ Use a bi-encoder for fast recall over millions of documents, then a cross-encoder to rerank the shortlist for precision. The two-stage pattern is the production standard.

Python Example

# Two-stage retrieval: bi-encoder recall + cross-encoder rerank
from sentence_transformers import SentenceTransformer, CrossEncoder

# Stage 1: Bi-encoder for fast recall
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
query_emb = bi_encoder.encode("physician salary trends")

# Assume corpus_embs is a precomputed matrix
# top_k_indices = faiss_index.search(query_emb, k=100)

# Stage 2: Cross-encoder for precise reranking
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [("physician salary trends", doc) for doc in shortlist]
scores = cross_encoder.predict(pairs)

# Reorder by cross-encoder scores for final results
reranked = sorted(zip(scores, shortlist), reverse=True)

Follow-up Questions

Can a cross-encoder replace a bi-encoder entirely?

Not at scale. A cross-encoder must process every query-document pair, which means scoring 1 million documents requires 1 million forward passes per query. Bi-encoders precompute document vectors once, reducing search to a fast nearest-neighbor lookup. The cross-encoder is reserved for a small shortlist.

What is the latency impact of adding a cross-encoder reranking stage?

Reranking 20-50 documents with a small cross-encoder typically adds 50-200ms of latency. For most search applications this is acceptable because the quality improvement justifies the cost. Larger cross-encoders or more candidates increase latency linearly.

Can you distill a cross-encoder into a bi-encoder?

Yes. Knowledge distillation trains a bi-encoder to mimic the cross-encoder's scores, producing a faster model with much of the cross-encoder's quality. This is how many high-quality bi-encoders are actually trained in practice.

How Embedding Dimension Affects System Design

Higher dimensions capture richer distinctions but increase storage, memory, and latency. Lower dimensions are cheaper and faster but may lose retrieval fidelity. The right dimension is a system decision, not just a model choice.

💡 Embedding dimension is like camera resolution. Higher resolution captures more detail, but the files are bigger, slower to transfer, and need more storage.

768

3.0

KB per vector

91%

Recall@10

4.2ms

Search Latency

The Full Trade-off

Dimension	Storage/Vector	Typical Recall	Best For
128-256	0.5-1 KB	80-85%	Mobile, edge, very large corpora
384-768	1.5-3 KB	88-93%	General production search
1024-1536	4-6 KB	93-96%	High-fidelity retrieval, complex domains
2048-3072	8-12 KB	95-97%	Research, maximum quality

System-Level Impact

Index footprint: 1M vectors at 768 dims (float32) = ~3 GB. At 3072 dims = ~12 GB. This determines whether the index fits in RAM.
Latency: Higher dimensions mean more computation per similarity comparison, affecting both brute-force and ANN search.
Cache efficiency: Smaller vectors fit better in CPU/GPU caches, improving throughput under high query load.
Migration cost: Changing dimensions requires re-embedding the entire corpus, which can be expensive for large collections.

Matryoshka Embeddings

Recent models like OpenAI's text-embedding-3 support Matryoshka representation learning, where you can truncate the embedding to fewer dimensions (e.g., use only the first 256 of 3072 dims) with graceful quality degradation. This lets you tune the accuracy-cost trade-off without retraining.

→ A strong engineer does not ask only "what is most accurate?" but also "what scales under real traffic?" Dimension choice is a system design decision.

Python Example

import numpy as np

# Compare storage cost at different dimensions
dims = [256, 384, 768, 1536, 3072]
n_docs = 1_000_000  # 1 million documents

for d in dims:
    # float32 = 4 bytes per dimension
    bytes_per_vec = d * 4
    total_gb = (bytes_per_vec * n_docs) / (1024**3)
    print(f"dim={d:5}  per_vec={bytes_per_vec/1024:.1f}KB  total={total_gb:.1f}GB")

# Matryoshka truncation: use first N dims
full_emb = np.random.randn(3072)
truncated = full_emb[:256]  # 12x smaller, still useful
truncated = truncated / np.linalg.norm(truncated)  # re-normalize

Follow-up Questions

Can quantization reduce storage without reducing dimensions?

Yes. Scalar quantization (float32 to int8) cuts storage by 4x with minimal quality loss. Product quantization (PQ) can compress further to 32-64 bytes per vector. Most production vector databases support quantized indexes for cost-efficient serving.

Is there a point of diminishing returns for dimensions?

Yes. Beyond 1024-1536 dimensions, recall improvements are typically marginal (less than 1-2%) while costs increase linearly. The optimal point depends on corpus complexity — simple FAQ search may plateau at 384 dims, while multilingual legal search benefits from 1536+.

Evaluating Embedding Models for Production

Evaluate embeddings on the task they will actually support. Offline vector similarity alone is not enough. The final question is always whether the embeddings improve business-relevant retrieval or decision quality.

💡 Evaluating an embedding model is like test-driving a car on the road you will actually use — not just checking the spec sheet.

Recall@10

0.87

How often the correct document appears in top 10

MRR

0.72

Mean Reciprocal Rank: how high the first correct result ranks

nDCG@10

0.81

Normalized Discounted Cumulative Gain: graded relevance

Answer Quality

0.78

Downstream RAG accuracy using retrieved passages

Key Metrics by Task

Task	Primary Metrics	What to Watch
Retrieval / RAG	Recall@k, MRR, nDCG	Are the right documents surfaced?
Clustering	Purity, Silhouette score	Are clusters interpretable and stable?
Recommendation	Precision@k, diversity	Are neighbors meaningfully relevant?
Classification	Accuracy, F1 with linear probe	Does the embedding separate classes?

Beyond Benchmarks

Sentence-level benchmarks (MTEB, STS) are useful for initial model selection, but the final question is always whether the embeddings improve business-relevant retrieval or decision quality in your specific pipeline. Two models with similar MTEB scores can perform very differently on your domain data.

Evaluation Checklist

Build a domain-specific test set: Use real queries and relevance judgments from your application.
Measure retrieval metrics: Recall@k and nDCG on your data, not just public benchmarks.
Test end-to-end: For RAG, measure downstream answer quality, not just retrieval quality.
Check edge cases: Test with out-of-domain queries, adversarial inputs, and multilingual content.
Monitor in production: Track retrieval quality over time as your corpus and query distribution evolve.

→ Offline vector similarity alone is not enough. Test embeddings inside the full pipeline they will power, on your domain data, with business-relevant metrics.

Python Example

import numpy as np

def recall_at_k(retrieved_ids, relevant_ids, k):
    """Fraction of relevant docs found in top-k results."""
    top_k = set(retrieved_ids[:k])
    relevant = set(relevant_ids)
    return len(top_k & relevant) / len(relevant)

def mrr(retrieved_ids, relevant_ids):
    """Mean Reciprocal Rank: 1/rank of first relevant result."""
    for i, doc_id in enumerate(retrieved_ids):
        if doc_id in relevant_ids:
            return 1.0 / (i + 1)
    return 0.0

# Example: evaluate a retrieval result
retrieved = ["d3", "d7", "d1", "d9", "d2"]
relevant  = ["d1", "d2"]

print("Recall@3:", recall_at_k(retrieved, relevant, 3))
print("Recall@5:", recall_at_k(retrieved, relevant, 5))
print("MRR:",      mrr(retrieved, relevant))

Follow-up Questions

How do you handle the cold start problem for evaluation?

Start with a small set of hand-labeled query-relevance pairs (50-100 queries with judged results). Use LLM-as-judge for initial scalable evaluation, then refine with human review. Even a small evaluation set is far better than relying only on public benchmarks.

What is MTEB and should you trust it?

MTEB (Massive Text Embedding Benchmark) evaluates models across retrieval, clustering, classification, and STS tasks. It is useful for shortlisting candidates, but models that rank similarly on MTEB can differ significantly on your specific domain. Always validate with your own data.

How often should you re-evaluate your embedding model?

Re-evaluate whenever your corpus distribution changes significantly (new content types, languages, domains) or when query patterns shift. A quarterly review with your domain test set is a good baseline. Monitor retrieval quality metrics continuously in production.