What retrieval-augmented generation is, how different retrieval methods work, and the design decisions that shape what evidence reaches the model.
What Is RAG?
Q: What is our refund policy?
A: "I believe the refund policy is 30 days..." (hallucinated — actual policy is 14 days)
Q: What is our refund policy?
Retrieved: "Refunds must be requested within 14 business days..."
A: "According to our policy, refunds must be requested within 14 business days."
Why RAG Exists
Not all knowledge should live inside model weights. External retrieval gives the system access to fresher and more auditable evidence. RAG improves controllability as much as it improves accuracy (Lewis et al., 2020):
- Freshness: Update the knowledge base without retraining the model
- Auditability: Every answer can cite its source document
- Domain specificity: Inject proprietary or regulated content the model was never trained on
- Cost: Cheaper than fine-tuning for knowledge-heavy use cases
The RAG Pipeline
A standard RAG pipeline has five stages: the user query arrives, it may be rewritten for better retrieval (see Topic 9: Query Rewriting), relevant passages are retrieved from the knowledge base, the passages are reranked (see Topic 8: Reranking), and finally the top passages are assembled into a prompt for the generator.
RAG Is a System, Not a Feature
In interviews, emphasize that RAG is a system design pattern, not a single API call. Quality depends on every component: chunking (Topic 4), retrieval method (Topic 2/Topic 3), metadata filtering (Topic 5), reranking (Topic 8), and evaluation (Topic 10). The same model can look excellent or terrible depending on how the knowledge base is chunked and ranked.
Python Example
# Minimal RAG pipeline skeleton
import openai
def simple_rag(query, retriever, client, top_k=5):
"""Retrieve, then generate a grounded answer."""
# Step 1: Retrieve relevant passages
passages = retriever.search(query, top_k=top_k)
# Step 2: Assemble context for the generator
context = "\n\n".join(
f"[Source {i+1}]: {p.text}"
for i, p in enumerate(passages)
)
# Step 3: Generate with retrieved context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """Answer based ONLY on the
provided context. If the context does not contain
the answer, say so. Cite [Source N] for each claim."""},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
],
)
return response.choices[0].message.content
When should you use RAG vs fine-tuning?
What are the main failure modes of RAG?
Does RAG eliminate hallucination?
Lexical vs Dense Retrieval
Lexical Retrieval (BM25, TF-IDF)
Lexical methods match documents based on term overlap. BM25 is the standard: it rewards documents containing query terms, adjusted for term frequency and document length. Lexical retrieval excels at exact matches — product names, error codes, legal clause numbers, acronyms.
Dense Retrieval (Bi-encoder)
Dense retrieval encodes both query and document into embedding vectors, then retrieves by vector similarity (typically cosine or dot product). It captures semantic meaning, so "how to fix deployment failures" matches "resolving CI/CD pipeline errors" even without shared keywords.
Trade-offs
| Dimension | Lexical | Dense |
|---|---|---|
| Exact matches | Excellent | Often misses rare identifiers |
| Semantic recall | Poor (requires word overlap) | Strong (captures meaning) |
| Speed | Very fast (inverted index) | Fast with ANN index |
| Index size | Moderate | Larger (stores vectors) |
| Zero-shot domains | Works immediately | Needs good embedding model |
Enterprise Reality
In enterprise systems, you often need both because users ask conceptually while documents are written operationally. This is why Topic 3: Hybrid Retrieval has become the production standard.
Python Example
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
docs = [
"Error E-4021: deployment pipeline timeout after 300s",
"Resolving CI/CD failures in production environments",
"How to configure deployment retry settings",
]
# --- Lexical retrieval with BM25 ---
tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)
query = "fix error E-4021 deploying"
lex_scores = bm25.get_scores(query.lower().split())
print("BM25 scores:", lex_scores) # high for doc with "E-4021"
# --- Dense retrieval with embeddings ---
model = SentenceTransformer("all-MiniLM-L6-v2")
doc_embs = model.encode(docs)
q_emb = model.encode([query])
dense_scores = np.dot(doc_embs, q_emb.T).flatten()
print("Dense scores:", dense_scores) # high for semantically similar
Is BM25 still relevant in the age of embeddings?
What is a bi-encoder vs a cross-encoder?
How do you handle multilingual retrieval?
Hybrid Retrieval
Why Hybrid Wins
Real enterprise queries contain a mix of exact identifiers and conceptual language. "How do I resolve E-4021 timeout during blue-green deployment?" has an error code (lexical) and a conceptual description (dense). Hybrid search reduces the blind spots of each method.
Fusion Strategies
| Strategy | How It Works | Pros/Cons |
|---|---|---|
| Reciprocal Rank Fusion (RRF) | Merges result lists by reciprocal rank position | Simple, no tuning needed; ignores score magnitude |
| Score normalization + weighting | Normalize BM25 and dense scores to [0,1], then combine with weights | Tunable; requires calibration |
| Learned fusion | Train a small model to combine signals | Best quality; needs training data |
Practical Guidance
Start with RRF — it requires no score calibration and works surprisingly well. Tune the BM25/dense weight ratio only after you have evaluation data (see Topic 10: Retrieval Metrics). Most teams find a 40/60 or 50/50 BM25/dense split works well as a default.
Python Example
def reciprocal_rank_fusion(result_lists, k=60):
"""Merge multiple ranked lists using RRF.
Args:
result_lists: list of lists of (doc_id, score) tuples
k: smoothing constant (default 60)
Returns:
sorted list of (doc_id, rrf_score) tuples
"""
scores = {}
for results in result_lists:
for rank, (doc_id, _) in enumerate(results):
if doc_id not in scores:
scores[doc_id] = 0.0
# RRF formula: 1 / (k + rank)
scores[doc_id] += 1.0 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Usage: fuse BM25 and dense retrieval results
bm25_results = [("doc_1", 8.2), ("doc_3", 6.1), ("doc_5", 4.0)]
dense_results = [("doc_3", 0.92), ("doc_2", 0.88), ("doc_1", 0.85)]
fused = reciprocal_rank_fusion([bm25_results, dense_results])
print("Fused ranking:", fused[:5])
How do you choose the weight between lexical and dense?
Does hybrid retrieval double the infrastructure cost?
Can you use more than two retrieval signals?
Chunking Strategies
Why Chunking Dominates Quality
Chunking shapes what the retriever can find and what the generator can understand. A poorly chunked knowledge base will defeat even the best embedding model and reranker. In interviews, say that chunking is the single highest-leverage design decision in most RAG systems.
Chunking Approaches
| Method | How It Works | Best For |
|---|---|---|
| Fixed token window | Split every N tokens with M overlap | Unstructured text, logs |
| Sentence boundary | Split at sentence or paragraph breaks | Articles, documentation |
| Structural (headings) | Split at document headings/sections | Technical docs, wikis |
| Semantic | Split when embedding similarity drops | Conversations, transcripts |
| Parent-child | Index small chunks, retrieve parent sections | Long documents needing context |
Overlap Matters
Overlap between chunks ensures that information at chunk boundaries is not lost. A typical setting is 10-15% overlap (e.g., 60-token overlap on 400-token chunks). Without overlap, a sentence split across two chunks may not be retrievable by either.
Practical Guidelines
- 200-500 tokens per chunk is a common starting range for general documents
- Respect document structure: do not split in the middle of tables, code blocks, or list items
- Include metadata: attach the document title, section heading, and source URL to each chunk
- Test empirically: the best chunk size depends on your queries and documents — there is no universal answer
Python Example
def chunk_text(tokens, chunk_size=400, overlap=60):
"""Split a token list into overlapping chunks.
Overlap ensures information at chunk boundaries
is captured by at least one chunk, improving recall.
"""
chunks = []
start = 0
while start < len(tokens):
end = min(start + chunk_size, len(tokens))
chunks.append(tokens[start:end])
if end == len(tokens):
break
# Step forward by (chunk_size - overlap)
start = end - overlap
return chunks
# Example: chunk a document of 1200 tokens
tokens = list(range(1200)) # simulated token IDs
chunks = chunk_text(tokens, chunk_size=400, overlap=60)
print(f"Created {len(chunks)} chunks")
for i, c in enumerate(chunks):
print(f" Chunk {i}: tokens {c[0]}-{c[-1]} ({len(c)} tokens)")
# Output: chunks with 60-token overlap at boundaries
What is parent-child chunking?
How do you chunk tables and code blocks?
Should chunk size match the embedding model's max input?
Metadata Filters
Query: "What is the return policy?"
Result 1: Return policy for Product X (Region: EU) WRONG region
Result 2: Return policy v2021 (outdated) WRONG version
Result 3: Return policy for Product Y (Region: US, current) CORRECT but ranked 3rd
Query: "What is the return policy?"
Filters: region=US, product=Y, status=current
Result 1: Return policy for Product Y (Region: US, current) CORRECT and ranked 1st
Why Filters Beat Better Embeddings
Retrieval quality is not only about better embeddings. Structured constraints can do a large amount of work cheaply and reliably. A filter that restricts search to the correct tenant, date range, or product eliminates entire categories of irrelevant results before semantic scoring even begins.
Common Filter Dimensions
| Filter | What It Constrains | Example |
|---|---|---|
| Tenant / Organization | Multi-tenant isolation | Only search Company A's docs |
| Date range | Temporal scope | Only docs updated in last 90 days |
| Product / Category | Domain scope | Only docs about "Enterprise Plan" |
| Language | Linguistic scope | Only English-language documents |
| Permission / ACL | Access control | Only docs the user can see |
| Document type | Format filtering | Only FAQ pages, not blog posts |
Implementation
Most vector databases support pre-filtering (apply filter before vector search) and post-filtering (apply filter after vector search). Pre-filtering is generally preferred because it reduces the search space and ensures you get the requested number of results. Post-filtering can return fewer results than requested if many are filtered out.
Python Example
# Metadata-filtered search with a vector database (Qdrant example)
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
client = QdrantClient("localhost", port=6333)
# Search with metadata filters: right tenant, recent docs only
results = client.search(
collection_name="knowledge_base",
query_vector=query_embedding,
query_filter=Filter(
must=[
# Only this tenant's documents
FieldCondition(
key="tenant_id",
match=MatchValue(value="acme-corp")
),
# Only documents updated in last 90 days
FieldCondition(
key="updated_at",
range=Range(gte="2025-02-01")
),
# Only English-language content
FieldCondition(
key="language",
match=MatchValue(value="en")
),
]
),
limit=10,
)
# Filters run BEFORE vector search, so all 10 results
# are guaranteed to match the constraints
Should you pre-filter or post-filter?
How do you handle permission-based filtering in RAG?
Can metadata filters replace better embeddings?
The infrastructure that makes retrieval fast and scalable, plus the techniques that improve ranking quality and the metrics that measure it.
Vector Databases
What Vector Databases Provide
- Efficient similarity search: Find the most similar vectors to a query vector at scale using ANN indexes (see Topic 7: Approximate Nearest Neighbor)
- Metadata filtering: Combine vector search with structured filters (see Topic 5: Metadata Filters)
- CRUD operations: Add, update, delete vectors without rebuilding the entire index
- Operational features: Replication, backups, access control, monitoring
Popular Options
| Database | Type | Key Strength |
|---|---|---|
| Pinecone | Managed SaaS | Zero-ops, automatic scaling |
| Weaviate | Open source | Built-in hybrid search, multi-modal |
| Qdrant | Open source | Filtering performance, Rust-based speed |
| Milvus | Open source | Scale to billions of vectors |
| pgvector | PostgreSQL extension | Use existing Postgres infrastructure |
| Chroma | Open source | Developer-friendly, great for prototyping |
The Important Interview Point
The vector store is infrastructure, not intelligence. It makes retrieval feasible and fast, but relevance still depends on the embedding model, chunking strategy, and ranking logic layered above it. Saying "we used Pinecone" does not explain why your RAG system produces good answers any more than saying "we used PostgreSQL" explains why your web app has good UX.
Python Example
# Example: indexing and searching with Chroma (lightweight)
import chromadb
# Create a local collection
client = chromadb.Client()
collection = client.create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"} # cosine similarity
)
# Add documents with embeddings and metadata
collection.add(
documents=[
"Refunds must be requested within 14 days.",
"Shipping is free for orders over $50.",
"Contact support at support@example.com.",
],
metadatas=[
{"category": "refunds", "region": "US"},
{"category": "shipping", "region": "US"},
{"category": "support", "region": "global"},
],
ids=["doc1", "doc2", "doc3"],
)
# Search with optional metadata filter
results = collection.query(
query_texts=["How do I get a refund?"],
n_results=2,
where={"region": "US"}, # metadata filter
)
print(results["documents"])
When should you use pgvector instead of a dedicated vector database?
How do you handle embedding model upgrades?
What about just using FAISS without a database?
Approximate Nearest Neighbor
Why ANN Exists
Exact nearest-neighbor search compares the query against every vector in the index. For a million vectors, that is a million dot products per query. For a hundred million vectors, it becomes impractical for real-time serving. ANN indexes structure the search space so that most vectors can be skipped.
Popular ANN Algorithms
| Algorithm | Approach | Trade-off |
|---|---|---|
| HNSW | Hierarchical navigable small world graph | High recall, higher memory |
| IVF | Inverted file with cluster-based partitioning | Lower memory, tunable recall/speed |
| ScaNN | Quantization + anisotropic scoring | Very fast, Google-optimized |
| Product Quantization | Compress vectors, search in compressed space | Smallest memory, some accuracy loss |
Tuning the Trade-off
Every ANN index has knobs that control the recall-vs-speed trade-off. For HNSW, the key parameters are ef_construction (build quality) and ef_search (query quality). Higher values improve recall but increase latency. The practical approach is to measure recall@k on a held-out set and tune until you hit your target (typically 95-99% recall).
Python Example
import faiss
import numpy as np
# Generate 100K random vectors (simulating embeddings)
d = 384 # embedding dimension
n = 100_000 # number of documents
vectors = np.random.randn(n, d).astype('float32')
# --- Exact search (brute force) ---
exact_index = faiss.IndexFlatIP(d) # inner product
exact_index.add(vectors)
# --- ANN search (HNSW) ---
hnsw_index = faiss.IndexHNSWFlat(d, 32) # 32 neighbors per node
hnsw_index.hnsw.efConstruction = 200 # build quality
hnsw_index.hnsw.efSearch = 64 # query quality (tune this)
hnsw_index.add(vectors)
# Compare: query with 5 random vectors
queries = np.random.randn(5, d).astype('float32')
import time
t0 = time.time()
D_exact, I_exact = exact_index.search(queries, 10)
print(f"Exact: {(time.time()-t0)*1000:.1f}ms")
t0 = time.time()
D_ann, I_ann = hnsw_index.search(queries, 10)
print(f"HNSW: {(time.time()-t0)*1000:.1f}ms")
# Measure recall: how many ANN results match exact results?
recall = np.mean([
len(set(I_exact[i]) & set(I_ann[i])) / 10.0
for i in range(5)
])
print(f"Recall@10: {recall:.1%}")
How do you choose between HNSW and IVF?
Does ANN recall loss actually affect RAG quality?
What about GPU-accelerated vector search?
Reranking
Two-Stage Architecture
The standard pattern is:
- Stage 1 (Retriever): Bi-encoder retrieves top-100 candidates quickly using precomputed embeddings
- Stage 2 (Reranker): Cross-encoder scores each candidate against the query with full attention, then returns the top-5 or top-10
This gives you the scalability of vector search and the precision of richer query-document interaction. The cross-encoder sees the query and document together, enabling deeper relevance assessment than independent embeddings can provide.
Why Reranking Works
| Property | Bi-Encoder (Retriever) | Cross-Encoder (Reranker) |
|---|---|---|
| Input | Query and doc encoded separately | Query + doc as one input pair |
| Interaction | Dot product of independent vectors | Full attention between query and doc tokens |
| Speed | ~1ms per query (precomputed) | ~50ms per query-doc pair |
| Quality | Good recall, approximate relevance | Higher precision, fine-grained relevance |
| Scalability | Millions of documents | 10-100 candidates per query |
In Practice
In interviews, explain reranking as a second-stage quality filter. It is one of the highest-impact improvements you can add to a RAG system, often improving answer quality by 10-20% without changing the index or embedding model. Popular rerankers include Cohere Rerank, BGE-reranker, and cross-encoder models from Hugging Face.
Python Example
from sentence_transformers import CrossEncoder
# Load a cross-encoder reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
query = "What is the refund policy for international orders?"
# Candidates from first-stage retrieval (bi-encoder)
candidates = [
"Shipping costs are non-refundable for all orders.",
"International refunds take 10-15 business days.",
"Our refund policy allows returns within 14 days.",
"Contact support for order tracking information.",
"International orders may incur customs duties.",
]
# Score each candidate against the query
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)
# Rerank by cross-encoder score
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
for doc, score in ranked:
print(f" {score:>6.3f} {doc}")
# "International refunds take 10-15 business days" ranks #1
How many candidates should you rerank?
Can you use an LLM as a reranker?
Does reranking help when the first-stage retrieval is already good?
Query Rewriting
Q2: "Common SSO error codes and fixes"
Why Users Write Bad Queries
Users do not naturally speak in index-friendly language. They use colloquialisms ("the SSO thing"), abbreviations, vague references ("it broke"), and compound questions that combine multiple intents. The retriever must bridge this gap.
Rewriting Strategies
| Strategy | What It Does | When to Use |
|---|---|---|
| Acronym expansion | SSO → Single Sign-On | Technical domains with heavy jargon |
| Keyword injection | Add related terms for BM25 | Hybrid retrieval systems |
| Query decomposition | Split complex query into sub-queries | Multi-part questions |
| Hypothetical answer | Generate what a good answer looks like, use it as query (HyDE) | Abstract or conceptual queries |
| Conversation context | Inject context from conversation history | Multi-turn chat RAG |
Cost vs Impact
Query rewriting is often one of the cheapest ways to improve recall without re-embedding the corpus. A single LLM call to rewrite the query costs a few cents and can dramatically improve first-stage retrieval. In interviews, mention query rewriting as a high-leverage, low-cost intervention.
Python Example
import openai
def rewrite_query(user_query, client, chat_history=None):
"""Rewrite a user query for better retrieval."""
context = ""
if chat_history:
context = "\n".join(
f"{m['role']}: {m['content']}"
for m in chat_history[-3:]
)
response = client.chat.completions.create(
model="gpt-4o-mini", # cheap and fast for rewrites
messages=[{
"role": "system",
"content": """Rewrite the user's question to improve
document retrieval. Expand acronyms, add relevant
keywords, and resolve ambiguous references using
conversation context if provided. Output ONLY the
rewritten query, nothing else."""
}, {
"role": "user",
"content": f"Context:\n{context}\n\nOriginal: {user_query}"
}],
temperature=0.0,
max_tokens=100,
)
return response.choices[0].message.content.strip()
# Example: "fix the SSO thing on mobile"
# -> "Troubleshoot Single Sign-On authentication failures on mobile"
What is HyDE (Hypothetical Document Embeddings)?
Does query rewriting add too much latency?
How do you handle multi-turn conversations in RAG?
Retrieval Metrics
Retrieval Scorecard
| Metric | What It Checks | Why It Matters |
|---|---|---|
| Recall@k | Relevant evidence appears in the candidate set | Low recall means the generator never sees the right facts |
| Precision@k | Returned context is mostly useful | High noise wastes context window and increases hallucination risk |
| MRR | Position of the first relevant result | Higher MRR means less noise before the answer |
| nDCG | Ranking quality among retrieved chunks | Strong reranking improves nDCG without reindexing |
| Freshness | Recent documents are retrievable | Prevents stale answers in policy and operational domains |
Where Retrieval Quality Is Won or Lost
| Component | Main Question | Typical Failure |
|---|---|---|
| Chunking | What unit should be retrieved? | Chunks too broad or too thin |
| Embeddings / Lexical | Can the system find likely evidence? | Semantic misses or exact-match misses |
| Metadata filters | Is the search in the right slice? | Wrong tenant, wrong date, wrong scope |
| Reranking | Are the best passages near the top? | Useful evidence buried too low |
| Prompt assembly | Does the model see enough clean support? | Context noise overwhelms the answer |
Connecting Retrieval to Generation
Retrieval metrics should not be isolated from generation outcomes. A retriever that looks strong offline but feeds noisy evidence to the generator may still fail the user task. The best evaluation pipeline measures both retrieval quality (recall, nDCG) and end-to-end answer quality (factual accuracy, citation correctness, user satisfaction).
Python Example
import numpy as np
def recall_at_k(retrieved_ids, relevant_ids, k):
"""Fraction of relevant docs found in top-k retrieved."""
top_k = set(retrieved_ids[:k])
relevant = set(relevant_ids)
return len(top_k & relevant) / max(len(relevant), 1)
def mrr(retrieved_ids, relevant_ids):
"""Mean Reciprocal Rank: 1/position of first relevant result."""
relevant = set(relevant_ids)
for i, doc_id in enumerate(retrieved_ids):
if doc_id in relevant:
return 1.0 / (i + 1)
return 0.0
def ndcg_at_k(retrieved_ids, relevant_ids, k):
"""Normalized Discounted Cumulative Gain at k."""
relevant = set(relevant_ids)
dcg = sum(
(1.0 if retrieved_ids[i] in relevant else 0.0) / np.log2(i + 2)
for i in range(min(k, len(retrieved_ids)))
)
ideal = sum(
1.0 / np.log2(i + 2)
for i in range(min(k, len(relevant_ids)))
)
return dcg / max(ideal, 1e-10)
# Example evaluation
retrieved = ["d3", "d7", "d1", "d5", "d2"]
relevant = ["d1", "d3"]
print(f"Recall@5: {recall_at_k(retrieved, relevant, 5):.2f}")
print(f"MRR: {mrr(retrieved, relevant):.2f}")
print(f"nDCG@5: {ndcg_at_k(retrieved, relevant, 5):.2f}")