What embeddings are, how they enable similarity-based retrieval, and the metrics that make comparison possible.
What Is an Embedding?
From Words to Vectors
An embedding maps a discrete object — a word, a sentence, an image — into a continuous vector space where geometric relationships encode semantic ones. Instead of treating text as an ID lookup, the model projects it into a space where distance and direction carry meaning.
This is what makes embeddings central to modern AI engineering. They let machines compare meaning without relying only on exact keyword matches. A query about "physician salary growth" can retrieve content about "doctor compensation trends" because the vectors are close, even though no words overlap.
How Embeddings Are Learned
Embedding vectors are not hand-crafted. They are learned during model training by optimizing an objective that pushes related items closer together and unrelated items apart. The training objective determines what "similarity" means:
- Contrastive learning: Pairs of related texts are pushed together; random negatives are pushed apart.
- Masked language modeling: Predicting missing tokens forces the model to encode contextual meaning.
- CLIP-style training: Images and their captions are aligned in a shared space, enabling cross-modal retrieval.
Why This Matters in Interviews
A strong interview answer explains both levels: embeddings are learned numerical representations, and their value comes from turning semantic comparison into vector operations. Avoid treating all embedding models as interchangeable — mention task mismatch, domain drift, and the difference between retrieval quality and generation quality. See Topic 3: Embedding Levels for how scope affects which embedding to use.
Python Example
from math import sqrt
# A minimal cosine similarity function
# Works on any two equal-length vectors
def cosine(a, b):
dot = sum(x * y for x, y in zip(a, b))
na = sqrt(sum(x * x for x in a))
nb = sqrt(sum(y * y for y in b))
return dot / (na * nb)
# Simulate two document embeddings (3-dimensional for clarity)
query = [0.30, 0.22, 0.91]
doc_a = [0.28, 0.20, 0.89] # semantically similar
doc_b = [0.10, 0.77, 0.14] # semantically different
print("query vs doc_a:", round(cosine(query, doc_a), 4))
print("query vs doc_b:", round(cosine(query, doc_b), 4))
How many dimensions do production embeddings typically have?
Can embeddings capture negation and other subtle meanings?
Are embeddings the same as a model's internal hidden states?
Why Embeddings Enable Semantic Search
Beyond Keyword Matching
Traditional search engines like BM25 rely on term frequency and exact keyword overlap. If a user searches for "physician salary growth" but the document says "doctor compensation trends," keyword search misses it entirely. Semantic search bridges this gap because both phrases produce similar embedding vectors.
This makes embeddings especially useful for question answering, support search, and long-tail user phrasing where users rarely phrase queries the same way the content was written.
The Caveat: False Proximity
Dense similarity can also retrieve conceptually adjacent but wrong content. A query about "bank interest rates" might retrieve documents about "river banks" if the embedding model is not domain-tuned. This is why production systems often add:
- Metadata filters — restrict results by category, date, or source.
- Lexical matching — combine dense retrieval with BM25 for hybrid search. See Topic 7: Dense vs Sparse.
- Reranking — use a cross-encoder to re-score the shortlist. See Topic 8: Bi-Encoder vs Cross-Encoder.
Interview Signal
The strongest answer treats embedding search as a probabilistic relevance stage, not a guaranteed truth mechanism. Mention that recall is high but precision requires downstream filtering.
Python Example
import numpy as np
# Simulate embeddings for a query and candidate documents
query_vec = np.array([0.8, 0.1, 0.6])
docs = {
"doctor compensation trends": np.array([0.78, 0.12, 0.58]),
"nurse hiring pipeline": np.array([0.55, 0.30, 0.45]),
"river bank erosion rates": np.array([0.10, 0.85, 0.05]),
}
# Rank by cosine similarity
for title, vec in docs.items():
sim = np.dot(query_vec, vec) / (
np.linalg.norm(query_vec) * np.linalg.norm(vec)
)
print(f"{sim:.3f} {title}")
How do you handle queries with very different phrasing from the corpus?
What is hybrid search and when should you use it?
Does the embedding model need to be fine-tuned for every domain?
Token, Sentence, and Document Embeddings
Three Levels of Representation
| Level | Scope | Typical Use | Training Signal |
|---|---|---|---|
| Token | Single subword | Internal model processing | Language modeling objective |
| Sentence | One sentence or chunk | Semantic search, clustering | Contrastive pairs, NLI |
| Document | Full document or page | Document retrieval, classification | Pooling or chunk aggregation |
Why This Distinction Matters
Token embeddings are part of the model's internal processing and are not usually used directly for search. They represent individual token identities, not semantic meaning at the sentence level. For search and clustering, you want sentence or chunk embeddings explicitly trained to preserve semantic similarity at that level.
Sentence-BERT (Reimers & Gurevych, 2019) made sentence-level similarity search dramatically more efficient than pairwise cross-encoding by producing fixed-size vectors that can be precomputed and indexed.
Choosing the Right Level
- Retrieval-augmented generation: Sentence or chunk embeddings (typically 256-512 tokens per chunk).
- Document classification: Document-level embeddings or pooled chunk embeddings.
- Named entity disambiguation: Token-level contextual embeddings from a fine-tuned encoder.
Python Example
# Sentence embedding via mean pooling (simplified)
import numpy as np
# Simulated token embeddings for "The cat sat"
token_embs = np.array([
[0.1, 0.5, 0.3], # "The"
[0.8, 0.2, 0.9], # "cat"
[0.4, 0.6, 0.1], # "sat"
])
# Mean pooling: average all token vectors into one
sentence_emb = token_embs.mean(axis=0)
print("Sentence embedding:", sentence_emb)
# Normalize to unit length for cosine similarity
sentence_emb = sentence_emb / np.linalg.norm(sentence_emb)
print("Normalized:", sentence_emb)
What is mean pooling vs. CLS pooling?
How do you embed documents longer than the model's context window?
Can multimodal embeddings mix text and images in the same space?
Why Engineers L2-Normalize Embeddings
What Normalization Does
L2 normalization divides each vector by its Euclidean norm (length), producing a unit vector that lies on the surface of a hypersphere. After normalization, every vector has length 1.0, so cosine_similarity(a, b) = dot_product(a, b).
Without normalization, one vector can dominate comparisons because of size rather than meaning. A document embedding with large activations would score higher in dot-product search even if a smaller-norm vector is semantically closer.
When to Normalize
- Always normalize when the embedding model documentation recommends it (most do).
- Always normalize when using a vector index that computes dot product (FAISS, Pinecone inner product mode).
- Skip normalization only when magnitude carries meaningful signal (e.g., some models encode confidence in vector length).
Practical Impact
Normalization simplifies index behavior and thresholding. With unit vectors, a cosine score of 0.9 always means the same level of similarity regardless of which documents are compared. Without normalization, thresholds become unreliable because scores depend on both direction and magnitude. See Topic 5: Cosine vs Dot Product for how normalization unifies these two metrics.
Python Example
import numpy as np
# Two vectors with different magnitudes
a = np.array([3.0, 4.0])
b = np.array([0.6, 0.8])
# Before normalization: same direction, different norms
print("Norm a:", np.linalg.norm(a)) # 5.0
print("Norm b:", np.linalg.norm(b)) # 1.0
print("Dot product:", np.dot(a, b)) # 5.0 (misleading!)
# After normalization: both unit length
a_n = a / np.linalg.norm(a)
b_n = b / np.linalg.norm(b)
print("Dot (normalized):", np.dot(a_n, b_n)) # 1.0 (identical direction)
Does normalization lose information?
Should I normalize before or after storing in the vector database?
What happens if I mix normalized and unnormalized vectors?
Cosine Similarity vs Dot Product
When Cosine Is Safer
Use cosine similarity when:
- Embedding norms vary across examples (different document lengths, mixed domains).
- The model documentation recommends cosine-based retrieval.
- You want ranking behavior that is less sensitive to scale differences from training or preprocessing.
When They Are Equivalent
If embeddings are Topic 4: L2-normalized, cosine similarity and dot product produce identical rankings because every vector has unit length. In production, the real rule is consistency: use the metric your embedding model, vector index, and offline evaluation pipeline are designed around.
The Mismatch Trap
A metric mismatch can quietly change retrieval quality even though the embeddings themselves never changed. If you train with cosine similarity but search with dot product (or vice versa), ranking order can shift. This is a common source of "nothing changed but recall dropped" bugs in production systems.
Python Example
import numpy as np
a = np.array([0.8, 0.6])
b = np.array([0.3, 0.9])
# Cosine similarity: angle-only comparison
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print("Cosine:", round(cosine, 4))
# Dot product: mixes angle AND magnitude
dot = np.dot(a, b)
print("Dot product:", round(dot, 4))
# After normalization, they match
a_n = a / np.linalg.norm(a)
b_n = b / np.linalg.norm(b)
print("Normalized dot:", round(np.dot(a_n, b_n), 4))
print("Same as cosine:", round(cosine, 4) == round(np.dot(a_n, b_n), 4))
What about Euclidean distance?
Which metric is fastest for retrieval at scale?
How embedding choices affect real systems — space health, retrieval architecture, dimensionality trade-offs, and evaluation strategy.
Hubness and Anisotropy
What Goes Wrong
Hubness occurs when certain vectors become "universal neighbors" — they appear in the top-k results for a disproportionate number of queries. This happens more often in high-dimensional spaces and is a known mathematical phenomenon, not a bug in the index.
Anisotropy means embeddings cluster into a narrow cone rather than spreading across the hypersphere. When most vectors point in roughly the same direction, cosine similarity scores become very high and very similar, making it hard to distinguish truly relevant results from noise.
Diagnosing These Issues
- Hub detection: Count how often each indexed vector appears in top-k results across a sample of queries. A heavy-tailed distribution signals hubness.
- Isotropy score: Measure the uniformity of embedding directions. Principal component analysis on a sample of embeddings reveals whether variance is concentrated in few dimensions.
Mitigation Strategies
| Strategy | How It Helps |
|---|---|
| Better fine-tuning | Hard negatives and diverse training data spread embeddings more evenly |
| Whitening / PCA | Decorrelates dimensions and reduces anisotropy |
| Reranking | A cross-encoder re-scores results and demotes hub items |
| Normalization | Reduces magnitude-based hubness but does not fix directional clustering |
Python Example
import numpy as np
# Generate 200 random embeddings in 64 dimensions
np.random.seed(42)
embeddings = np.random.randn(200, 64)
# Normalize to unit length
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / norms
# Compute pairwise cosine similarity (= dot product after norm)
sims = embeddings @ embeddings.T
# Count how often each vector appears in top-5 for others
hub_counts = np.zeros(200)
for i in range(200):
top5 = np.argsort(sims[i])[-6:-1] # exclude self
hub_counts[top5] += 1
print("Max hub count:", int(hub_counts.max()))
print("Mean hub count:", hub_counts.mean())
Does hubness get worse in higher dimensions?
Can you fix anisotropy after training?
Dense vs Sparse Representations
The Recall-Precision Trade-off
Dense methods capture semantic similarity better — "car" and "automobile" produce similar vectors even though they share no characters. Sparse methods like BM25 preserve exact term evidence better — they will not miss a document containing the exact product code or legal citation you searched for.
Hybrid Retrieval
Because each approach has complementary strengths, hybrid retrieval has become the default in many enterprise systems. The typical pattern:
- Dense search: Retrieve top-k candidates by embedding similarity.
- Sparse search: Retrieve top-k candidates by BM25 keyword matching.
- Merge: Combine result lists using reciprocal rank fusion (RRF) or weighted scoring.
- Rerank: Use a Topic 8: cross-encoder to re-score the merged shortlist.
When to Use Each
| Scenario | Best Approach | Why |
|---|---|---|
| Natural language Q&A | Dense or hybrid | Users paraphrase; semantic matching needed |
| Product code lookup | Sparse or hybrid | Exact match is critical |
| Legal document search | Hybrid | Both terminology precision and conceptual similarity matter |
| Internal knowledge base | Hybrid with metadata | Documents vary in length, style, and domain |
Python Example
# Reciprocal Rank Fusion: merge dense and sparse results
def rrf_merge(dense_ranks, sparse_ranks, k=60):
"""Merge two ranked lists using RRF scoring."""
scores = {}
for rank, doc_id in enumerate(dense_ranks):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
for rank, doc_id in enumerate(sparse_ranks):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
# Sort by combined score, descending
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
dense = ["doc_3", "doc_1", "doc_7", "doc_5"]
sparse = ["doc_1", "doc_9", "doc_3", "doc_2"]
print(rrf_merge(dense, sparse))
What is learned sparse retrieval (e.g., SPLADE)?
How much does hybrid retrieval improve over dense-only?
Bi-Encoder vs Cross-Encoder
Speed vs Quality Trade-off
| Property | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Encoding | Independent (query + doc separate) | Joint (query + doc together) |
| Speed | Fast — vectors precomputed | Slow — must process each pair |
| Scalability | Millions of documents | Tens to hundreds of candidates |
| Quality | Good recall | Better precision |
| Use case | First-stage retrieval | Second-stage reranking |
The Two-Stage Pipeline
The standard production pattern combines both:
- Stage 1 (Recall): A bi-encoder retrieves the top 100-1000 candidates from the full corpus using precomputed vectors and ANN search.
- Stage 2 (Precision): A cross-encoder re-scores the top 20-50 candidates by processing each query-document pair jointly, then returns the final top-k.
This gives you the speed of vector search with the quality of cross-attention, at manageable cost.
Python Example
# Two-stage retrieval: bi-encoder recall + cross-encoder rerank
from sentence_transformers import SentenceTransformer, CrossEncoder
# Stage 1: Bi-encoder for fast recall
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
query_emb = bi_encoder.encode("physician salary trends")
# Assume corpus_embs is a precomputed matrix
# top_k_indices = faiss_index.search(query_emb, k=100)
# Stage 2: Cross-encoder for precise reranking
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [("physician salary trends", doc) for doc in shortlist]
scores = cross_encoder.predict(pairs)
# Reorder by cross-encoder scores for final results
reranked = sorted(zip(scores, shortlist), reverse=True)
Can a cross-encoder replace a bi-encoder entirely?
What is the latency impact of adding a cross-encoder reranking stage?
Can you distill a cross-encoder into a bi-encoder?
How Embedding Dimension Affects System Design
The Full Trade-off
| Dimension | Storage/Vector | Typical Recall | Best For |
|---|---|---|---|
| 128-256 | 0.5-1 KB | 80-85% | Mobile, edge, very large corpora |
| 384-768 | 1.5-3 KB | 88-93% | General production search |
| 1024-1536 | 4-6 KB | 93-96% | High-fidelity retrieval, complex domains |
| 2048-3072 | 8-12 KB | 95-97% | Research, maximum quality |
System-Level Impact
- Index footprint: 1M vectors at 768 dims (float32) = ~3 GB. At 3072 dims = ~12 GB. This determines whether the index fits in RAM.
- Latency: Higher dimensions mean more computation per similarity comparison, affecting both brute-force and ANN search.
- Cache efficiency: Smaller vectors fit better in CPU/GPU caches, improving throughput under high query load.
- Migration cost: Changing dimensions requires re-embedding the entire corpus, which can be expensive for large collections.
Matryoshka Embeddings
Recent models like OpenAI's text-embedding-3 support Matryoshka representation learning, where you can truncate the embedding to fewer dimensions (e.g., use only the first 256 of 3072 dims) with graceful quality degradation. This lets you tune the accuracy-cost trade-off without retraining.
Python Example
import numpy as np
# Compare storage cost at different dimensions
dims = [256, 384, 768, 1536, 3072]
n_docs = 1_000_000 # 1 million documents
for d in dims:
# float32 = 4 bytes per dimension
bytes_per_vec = d * 4
total_gb = (bytes_per_vec * n_docs) / (1024**3)
print(f"dim={d:5} per_vec={bytes_per_vec/1024:.1f}KB total={total_gb:.1f}GB")
# Matryoshka truncation: use first N dims
full_emb = np.random.randn(3072)
truncated = full_emb[:256] # 12x smaller, still useful
truncated = truncated / np.linalg.norm(truncated) # re-normalize
Can quantization reduce storage without reducing dimensions?
Is there a point of diminishing returns for dimensions?
Evaluating Embedding Models for Production
Key Metrics by Task
| Task | Primary Metrics | What to Watch |
|---|---|---|
| Retrieval / RAG | Recall@k, MRR, nDCG | Are the right documents surfaced? |
| Clustering | Purity, Silhouette score | Are clusters interpretable and stable? |
| Recommendation | Precision@k, diversity | Are neighbors meaningfully relevant? |
| Classification | Accuracy, F1 with linear probe | Does the embedding separate classes? |
Beyond Benchmarks
Sentence-level benchmarks (MTEB, STS) are useful for initial model selection, but the final question is always whether the embeddings improve business-relevant retrieval or decision quality in your specific pipeline. Two models with similar MTEB scores can perform very differently on your domain data.
Evaluation Checklist
- Build a domain-specific test set: Use real queries and relevance judgments from your application.
- Measure retrieval metrics: Recall@k and nDCG on your data, not just public benchmarks.
- Test end-to-end: For RAG, measure downstream answer quality, not just retrieval quality.
- Check edge cases: Test with out-of-domain queries, adversarial inputs, and multilingual content.
- Monitor in production: Track retrieval quality over time as your corpus and query distribution evolve.
Python Example
import numpy as np
def recall_at_k(retrieved_ids, relevant_ids, k):
"""Fraction of relevant docs found in top-k results."""
top_k = set(retrieved_ids[:k])
relevant = set(relevant_ids)
return len(top_k & relevant) / len(relevant)
def mrr(retrieved_ids, relevant_ids):
"""Mean Reciprocal Rank: 1/rank of first relevant result."""
for i, doc_id in enumerate(retrieved_ids):
if doc_id in relevant_ids:
return 1.0 / (i + 1)
return 0.0
# Example: evaluate a retrieval result
retrieved = ["d3", "d7", "d1", "d9", "d2"]
relevant = ["d1", "d2"]
print("Recall@3:", recall_at_k(retrieved, relevant, 3))
print("Recall@5:", recall_at_k(retrieved, relevant, 5))
print("MRR:", mrr(retrieved, relevant))