Ch 8: Retrieval Foundations for Large Language Model Systems

Retrieval Fundamentals

What retrieval-augmented generation is, how different retrieval methods work, and the design decisions that shape what evidence reaches the model.

What Is RAG?

RAG (Retrieval-Augmented Generation) is a pattern where the system first retrieves relevant external information, then feeds it to the language model as context for generation. The goal is to improve factual grounding without retraining the model.

💡 RAG is an open-book exam for the model — instead of relying on memorized knowledge, it looks up the relevant pages before answering.

Without RAG

Q: What is our refund policy?

A: "I believe the refund policy is 30 days..." (hallucinated — actual policy is 14 days)

With RAG

Q: What is our refund policy?

Retrieved: "Refunds must be requested within 14 business days..."

A: "According to our policy, refunds must be requested within 14 business days."

Why RAG Exists

Not all knowledge should live inside model weights. External retrieval gives the system access to fresher and more auditable evidence. RAG improves controllability as much as it improves accuracy (Lewis et al., 2020):

Freshness: Update the knowledge base without retraining the model
Auditability: Every answer can cite its source document
Domain specificity: Inject proprietary or regulated content the model was never trained on
Cost: Cheaper than fine-tuning for knowledge-heavy use cases

The RAG Pipeline

A standard RAG pipeline has five stages: the user query arrives, it may be rewritten for better retrieval (see Topic 9: Query Rewriting), relevant passages are retrieved from the knowledge base, the passages are reranked (see Topic 8: Reranking), and finally the top passages are assembled into a prompt for the generator.

RAG Is a System, Not a Feature

In interviews, emphasize that RAG is a system design pattern, not a single API call. Quality depends on every component: chunking (Topic 4), retrieval method (Topic 2/Topic 3), metadata filtering (Topic 5), reranking (Topic 8), and evaluation (Topic 10). The same model can look excellent or terrible depending on how the knowledge base is chunked and ranked.

→ RAG separates knowledge from reasoning. The model reasons; the retriever provides the facts. Quality depends on every link in the chain.

Python Example

# Minimal RAG pipeline skeleton
import openai

def simple_rag(query, retriever, client, top_k=5):
    """Retrieve, then generate a grounded answer."""

    # Step 1: Retrieve relevant passages
    passages = retriever.search(query, top_k=top_k)

    # Step 2: Assemble context for the generator
    context = "\n\n".join(
        f"[Source {i+1}]: {p.text}"
        for i, p in enumerate(passages)
    )

    # Step 3: Generate with retrieved context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Answer based ONLY on the
provided context. If the context does not contain
the answer, say so. Cite [Source N] for each claim."""},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ],
    )
    return response.choices[0].message.content

Follow-up Questions

When should you use RAG vs fine-tuning?

Use RAG when the knowledge changes frequently, when auditability matters, or when the knowledge base is large. Use fine-tuning when you need to change the model's behavior, style, or reasoning patterns. Many production systems use both: fine-tuning for tone and format, RAG for factual content.

What are the main failure modes of RAG?

The three main failures are: (1) retrieval miss — the relevant document exists but is not retrieved, (2) ranking failure — the relevant passage is retrieved but buried below noise, and (3) generation failure — the model ignores or contradicts the retrieved evidence. Each has different diagnostic and fix strategies.

Does RAG eliminate hallucination?

No. RAG reduces hallucination by providing evidence, but the model can still ignore the context, misinterpret it, or confabulate details. Strong RAG systems add guardrails: citation requirements, confidence thresholds, and answer-grounding checks that verify each claim against the retrieved passages.

Lexical vs Dense Retrieval

Lexical retrieval matches explicit terms (strong for exact keywords, product codes, error messages). Dense retrieval uses embeddings to find semantically related content even when query and document use different words. Each has blind spots the other covers.

💡 Lexical retrieval is a librarian who finds books by title keywords. Dense retrieval is a librarian who understands what you mean and finds books on the same topic, even if titled differently.

Query: "How do I fix error E-4021 when deploying?"

Lexical (BM25)

Dense (Embeddings)

Lexical Retrieval (BM25, TF-IDF)

Lexical methods match documents based on term overlap. BM25 is the standard: it rewards documents containing query terms, adjusted for term frequency and document length. Lexical retrieval excels at exact matches — product names, error codes, legal clause numbers, acronyms.

Dense Retrieval (Bi-encoder)

Dense retrieval encodes both query and document into embedding vectors, then retrieves by vector similarity (typically cosine or dot product). It captures semantic meaning, so "how to fix deployment failures" matches "resolving CI/CD pipeline errors" even without shared keywords.

Trade-offs

Dimension	Lexical	Dense
Exact matches	Excellent	Often misses rare identifiers
Semantic recall	Poor (requires word overlap)	Strong (captures meaning)
Speed	Very fast (inverted index)	Fast with ANN index
Index size	Moderate	Larger (stores vectors)
Zero-shot domains	Works immediately	Needs good embedding model

Enterprise Reality

In enterprise systems, you often need both because users ask conceptually while documents are written operationally. This is why Topic 3: Hybrid Retrieval has become the production standard.

→ Lexical protects exact matches; dense improves semantic recall. Production systems almost always need both.

Python Example

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np

docs = [
    "Error E-4021: deployment pipeline timeout after 300s",
    "Resolving CI/CD failures in production environments",
    "How to configure deployment retry settings",
]

# --- Lexical retrieval with BM25 ---
tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)
query = "fix error E-4021 deploying"
lex_scores = bm25.get_scores(query.lower().split())
print("BM25 scores:", lex_scores)  # high for doc with "E-4021"

# --- Dense retrieval with embeddings ---
model = SentenceTransformer("all-MiniLM-L6-v2")
doc_embs = model.encode(docs)
q_emb = model.encode([query])
dense_scores = np.dot(doc_embs, q_emb.T).flatten()
print("Dense scores:", dense_scores)  # high for semantically similar

Follow-up Questions

Is BM25 still relevant in the age of embeddings?

Absolutely. BM25 is a strong first-stage retriever that requires no GPU, no embedding model, and no vector index. It excels at exact-match queries that embeddings often fumble (product IDs, error codes, legal references). In hybrid systems, BM25 frequently catches documents that dense retrieval misses.

What is a bi-encoder vs a cross-encoder?

A bi-encoder encodes query and document independently, enabling precomputation and fast retrieval. A cross-encoder takes the query-document pair as joint input, producing higher-quality relevance scores but at much higher cost. Bi-encoders are for first-stage retrieval; cross-encoders are for reranking (see Topic 8: Reranking).

How do you handle multilingual retrieval?

Use a multilingual embedding model (e.g., multilingual-e5-large) that maps texts from different languages into the same vector space. For lexical retrieval, you need language-specific tokenization and stemming. Hybrid systems often pair language-aware BM25 with multilingual dense retrieval.

Hybrid Retrieval

Hybrid retrieval combines lexical and dense signals so the system benefits from exact terminology and semantic similarity at the same time. Dense retrieval alone may miss rare identifiers; lexical retrieval alone may miss paraphrases. Together they improve first-stage recall.

💡 Hybrid retrieval is like searching with both Google and a library catalog — one finds what you meant, the other finds what you said.

BM25 Results

Exact term matches

Dense Results

Semantic matches

→

Fused Ranking

Best of both

Why Hybrid Wins

Real enterprise queries contain a mix of exact identifiers and conceptual language. "How do I resolve E-4021 timeout during blue-green deployment?" has an error code (lexical) and a conceptual description (dense). Hybrid search reduces the blind spots of each method.

Fusion Strategies

Strategy	How It Works	Pros/Cons
Reciprocal Rank Fusion (RRF)	Merges result lists by reciprocal rank position	Simple, no tuning needed; ignores score magnitude
Score normalization + weighting	Normalize BM25 and dense scores to [0,1], then combine with weights	Tunable; requires calibration
Learned fusion	Train a small model to combine signals	Best quality; needs training data

Practical Guidance

Start with RRF — it requires no score calibration and works surprisingly well. Tune the BM25/dense weight ratio only after you have evaluation data (see Topic 10: Retrieval Metrics). Most teams find a 40/60 or 50/50 BM25/dense split works well as a default.

→ Hybrid retrieval is the production default for RAG. Start with Reciprocal Rank Fusion and tune from there.

Python Example

def reciprocal_rank_fusion(result_lists, k=60):
    """Merge multiple ranked lists using RRF.

    Args:
        result_lists: list of lists of (doc_id, score) tuples
        k: smoothing constant (default 60)
    Returns:
        sorted list of (doc_id, rrf_score) tuples
    """
    scores = {}
    for results in result_lists:
        for rank, (doc_id, _) in enumerate(results):
            if doc_id not in scores:
                scores[doc_id] = 0.0
            # RRF formula: 1 / (k + rank)
            scores[doc_id] += 1.0 / (k + rank + 1)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Usage: fuse BM25 and dense retrieval results
bm25_results = [("doc_1", 8.2), ("doc_3", 6.1), ("doc_5", 4.0)]
dense_results = [("doc_3", 0.92), ("doc_2", 0.88), ("doc_1", 0.85)]

fused = reciprocal_rank_fusion([bm25_results, dense_results])
print("Fused ranking:", fused[:5])

Follow-up Questions

How do you choose the weight between lexical and dense?

Start with equal weights or RRF (which is rank-based, not weight-based). Then tune using a labeled evaluation set where you know which documents are relevant for each query. If your domain has many exact identifiers, increase the lexical weight. If queries are mostly conceptual, favor dense.

Does hybrid retrieval double the infrastructure cost?

Not quite. You need both an inverted index (for BM25) and a vector index (for dense), but the retrieval step is still fast because both searches run in parallel. The main cost increase is storage (maintaining both indexes) and the fusion step, which is negligible. Many vector databases (Weaviate, Qdrant, Vespa) support hybrid search natively.

Can you use more than two retrieval signals?

Yes. Production systems sometimes add signals like recency boost, click-through rate, document authority, or user preference embeddings. RRF naturally supports any number of ranked lists. The challenge is not combining signals but evaluating whether each additional signal actually improves end-to-end quality.

Chunking Strategies

Chunking decides the unit of retrieval. If chunks are too large, retrieval becomes noisy. If chunks are too small, the answer loses surrounding context. Good chunking aligns with the structure of the source material and is both a recall and precision decision.

💡 Chunking is like cutting a pizza — too few slices and each is unwieldy, too many and the toppings fall apart. The right cut follows the natural boundaries.

Why Chunking Dominates Quality

Chunking shapes what the retriever can find and what the generator can understand. A poorly chunked knowledge base will defeat even the best embedding model and reranker. In interviews, say that chunking is the single highest-leverage design decision in most RAG systems.

Chunking Approaches

Method	How It Works	Best For
Fixed token window	Split every N tokens with M overlap	Unstructured text, logs
Sentence boundary	Split at sentence or paragraph breaks	Articles, documentation
Structural (headings)	Split at document headings/sections	Technical docs, wikis
Semantic	Split when embedding similarity drops	Conversations, transcripts
Parent-child	Index small chunks, retrieve parent sections	Long documents needing context

Overlap Matters

Overlap between chunks ensures that information at chunk boundaries is not lost. A typical setting is 10-15% overlap (e.g., 60-token overlap on 400-token chunks). Without overlap, a sentence split across two chunks may not be retrievable by either.

Practical Guidelines

200-500 tokens per chunk is a common starting range for general documents
Respect document structure: do not split in the middle of tables, code blocks, or list items
Include metadata: attach the document title, section heading, and source URL to each chunk
Test empirically: the best chunk size depends on your queries and documents — there is no universal answer

→ Chunking is the most underestimated decision in RAG. Get it wrong and nothing downstream can compensate.

Python Example

def chunk_text(tokens, chunk_size=400, overlap=60):
    """Split a token list into overlapping chunks.

    Overlap ensures information at chunk boundaries
    is captured by at least one chunk, improving recall.
    """
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunks.append(tokens[start:end])
        if end == len(tokens):
            break
        # Step forward by (chunk_size - overlap)
        start = end - overlap
    return chunks

# Example: chunk a document of 1200 tokens
tokens = list(range(1200))  # simulated token IDs
chunks = chunk_text(tokens, chunk_size=400, overlap=60)
print(f"Created {len(chunks)} chunks")
for i, c in enumerate(chunks):
    print(f"  Chunk {i}: tokens {c[0]}-{c[-1]} ({len(c)} tokens)")
# Output: chunks with 60-token overlap at boundaries

Follow-up Questions

What is parent-child chunking?

Index small chunks (e.g., 200 tokens) for precise retrieval, but when a small chunk is retrieved, return its parent chunk (e.g., the full 800-token section) to the generator. This gives you precise retrieval with rich context. LlamaIndex and LangChain both support this pattern.

How do you chunk tables and code blocks?

Never split a table or code block across chunks — it destroys the information. Treat each table/code block as an atomic unit. If it exceeds your chunk size, either increase the limit for that chunk or summarize the table into text and index the summary alongside the full table.

Should chunk size match the embedding model's max input?

No. Most embedding models accept up to 512 tokens, but that does not mean your chunks should be 512 tokens. Shorter chunks (200-400 tokens) often produce better retrieval precision because the embedding more accurately represents a single focused topic rather than a mixture of topics.

Metadata Filters

Metadata filters narrow the search space using structured attributes — product, region, date, language, permission scope, or document type. This helps retrieve from the right neighborhood before the system even ranks semantic relevance. Metadata filtering is often the highest-return improvement in production search.

💡 Metadata filters are like choosing the right library section before searching the shelves — you would not look for cooking recipes in the engineering aisle.

Without Filters

Query: "What is the return policy?"

Result 1: Return policy for Product X (Region: EU) WRONG region

Result 2: Return policy v2021 (outdated) WRONG version

Result 3: Return policy for Product Y (Region: US, current) CORRECT but ranked 3rd

With Filters

Query: "What is the return policy?"

Filters: region=US, product=Y, status=current

Result 1: Return policy for Product Y (Region: US, current) CORRECT and ranked 1st

Why Filters Beat Better Embeddings

Retrieval quality is not only about better embeddings. Structured constraints can do a large amount of work cheaply and reliably. A filter that restricts search to the correct tenant, date range, or product eliminates entire categories of irrelevant results before semantic scoring even begins.

Common Filter Dimensions

Filter	What It Constrains	Example
Tenant / Organization	Multi-tenant isolation	Only search Company A's docs
Date range	Temporal scope	Only docs updated in last 90 days
Product / Category	Domain scope	Only docs about "Enterprise Plan"
Language	Linguistic scope	Only English-language documents
Permission / ACL	Access control	Only docs the user can see
Document type	Format filtering	Only FAQ pages, not blog posts

Implementation

Most vector databases support pre-filtering (apply filter before vector search) and post-filtering (apply filter after vector search). Pre-filtering is generally preferred because it reduces the search space and ensures you get the requested number of results. Post-filtering can return fewer results than requested if many are filtered out.

→ Metadata filters are cheap, reliable, and often deliver more quality improvement than upgrading the embedding model. Always design your index with filter dimensions in mind.

Python Example

# Metadata-filtered search with a vector database (Qdrant example)
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

client = QdrantClient("localhost", port=6333)

# Search with metadata filters: right tenant, recent docs only
results = client.search(
    collection_name="knowledge_base",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            # Only this tenant's documents
            FieldCondition(
                key="tenant_id",
                match=MatchValue(value="acme-corp")
            ),
            # Only documents updated in last 90 days
            FieldCondition(
                key="updated_at",
                range=Range(gte="2025-02-01")
            ),
            # Only English-language content
            FieldCondition(
                key="language",
                match=MatchValue(value="en")
            ),
        ]
    ),
    limit=10,
)
# Filters run BEFORE vector search, so all 10 results
# are guaranteed to match the constraints

Follow-up Questions

Should you pre-filter or post-filter?

Pre-filter when the filter cardinality is high (many possible values) or when you need exactly k results. Post-filter when filters are very selective and would leave too few candidates for meaningful vector search. Most production systems default to pre-filtering.

How do you handle permission-based filtering in RAG?

Store access control lists (ACLs) as metadata on each chunk. At query time, filter by the user's permissions before vector search. This ensures the model never sees documents the user should not access. Be careful with caching — cached results must respect per-user permissions.

Can metadata filters replace better embeddings?

They complement each other. Filters eliminate structurally wrong results (wrong tenant, wrong date). Embeddings handle semantically wrong results (irrelevant topic). You need both. However, fixing missing or incorrect metadata often yields faster quality improvements than upgrading the embedding model.

Infrastructure & Evaluation

The infrastructure that makes retrieval fast and scalable, plus the techniques that improve ranking quality and the metrics that measure it.

Vector Databases

A vector database stores embeddings and supports efficient nearest-neighbor search at scale. It makes retrieval feasible and fast, but relevance still depends on the embedding model, chunking strategy, and ranking logic layered above it. The vector store is infrastructure, not intelligence.

💡 A vector database is a warehouse with a smart forklift — it stores and fetches boxes efficiently, but it does not decide which box contains what you need.

Store

Embeddings + metadata indexed for fast access

ANN search with metadata filters in milliseconds

Operate

Replication, durability, monitoring, scaling

What Vector Databases Provide

Efficient similarity search: Find the most similar vectors to a query vector at scale using ANN indexes (see Topic 7: Approximate Nearest Neighbor)
Metadata filtering: Combine vector search with structured filters (see Topic 5: Metadata Filters)
CRUD operations: Add, update, delete vectors without rebuilding the entire index
Operational features: Replication, backups, access control, monitoring

Popular Options

Database	Type	Key Strength
Pinecone	Managed SaaS	Zero-ops, automatic scaling
Weaviate	Open source	Built-in hybrid search, multi-modal
Qdrant	Open source	Filtering performance, Rust-based speed
Milvus	Open source	Scale to billions of vectors
pgvector	PostgreSQL extension	Use existing Postgres infrastructure
Chroma	Open source	Developer-friendly, great for prototyping

The Important Interview Point

The vector store is infrastructure, not intelligence. It makes retrieval feasible and fast, but relevance still depends on the embedding model, chunking strategy, and ranking logic layered above it. Saying "we used Pinecone" does not explain why your RAG system produces good answers any more than saying "we used PostgreSQL" explains why your web app has good UX.

→ Vector databases are essential infrastructure, but they are not the source of retrieval quality. Relevance comes from embeddings, chunking, and ranking.

Python Example

# Example: indexing and searching with Chroma (lightweight)
import chromadb

# Create a local collection
client = chromadb.Client()
collection = client.create_collection(
    name="knowledge_base",
    metadata={"hnsw:space": "cosine"}  # cosine similarity
)

# Add documents with embeddings and metadata
collection.add(
    documents=[
        "Refunds must be requested within 14 days.",
        "Shipping is free for orders over $50.",
        "Contact support at support@example.com.",
    ],
    metadatas=[
        {"category": "refunds", "region": "US"},
        {"category": "shipping", "region": "US"},
        {"category": "support", "region": "global"},
    ],
    ids=["doc1", "doc2", "doc3"],
)

# Search with optional metadata filter
results = collection.query(
    query_texts=["How do I get a refund?"],
    n_results=2,
    where={"region": "US"},  # metadata filter
)
print(results["documents"])

Follow-up Questions

When should you use pgvector instead of a dedicated vector database?

Use pgvector when you have fewer than ~1M vectors, already use PostgreSQL, and want to avoid adding a new service to your infrastructure. For larger scale, higher QPS, or advanced features (multi-vector queries, GPU-accelerated search), a dedicated vector database is usually worth the operational overhead.

How do you handle embedding model upgrades?

When you upgrade the embedding model, you must re-embed and re-index the entire corpus. Vectors from different models live in different spaces and cannot be mixed. Plan for this by building a re-indexing pipeline that can run in parallel with the live index, then swap atomically.

What about just using FAISS without a database?

FAISS is a vector search library, not a database. It provides fast similarity search but lacks persistence, CRUD operations, metadata filtering, and operational features. FAISS is excellent for prototyping and for embedding into applications, but production systems usually need the durability and management features of a proper database.

Approximate Nearest Neighbor

Exact nearest-neighbor search is too slow for large indexes. Approximate methods (HNSW, IVF, ScaNN) trade a small amount of recall for dramatically better speed and scalability. The question is not whether approximation is philosophically pure, but whether it preserves enough relevance at production speed.

💡 ANN is like checking the top few shelves in the right section of the library instead of scanning every book in the building. You might miss a rare find, but you will answer 1000x faster.

Exact Search (Brute Force)

Recall100%

Latency (1M vectors)~500ms

Latency (100M vectors)~50,000ms

ScalabilityLinear O(n)

ANN Search (HNSW)

Recall95-99%

Latency (1M vectors)~2ms

Latency (100M vectors)~10ms

ScalabilityO(log n)

Why ANN Exists

Exact nearest-neighbor search compares the query against every vector in the index. For a million vectors, that is a million dot products per query. For a hundred million vectors, it becomes impractical for real-time serving. ANN indexes structure the search space so that most vectors can be skipped.

Popular ANN Algorithms

Algorithm	Approach	Trade-off
HNSW	Hierarchical navigable small world graph	High recall, higher memory
IVF	Inverted file with cluster-based partitioning	Lower memory, tunable recall/speed
ScaNN	Quantization + anisotropic scoring	Very fast, Google-optimized
Product Quantization	Compress vectors, search in compressed space	Smallest memory, some accuracy loss

Tuning the Trade-off

Every ANN index has knobs that control the recall-vs-speed trade-off. For HNSW, the key parameters are ef_construction (build quality) and ef_search (query quality). Higher values improve recall but increase latency. The practical approach is to measure recall@k on a held-out set and tune until you hit your target (typically 95-99% recall).

→ ANN makes vector search practical at scale. A small recall trade-off (1-5%) buys orders-of-magnitude speed improvement.

Python Example

import faiss
import numpy as np

# Generate 100K random vectors (simulating embeddings)
d = 384                       # embedding dimension
n = 100_000                   # number of documents
vectors = np.random.randn(n, d).astype('float32')

# --- Exact search (brute force) ---
exact_index = faiss.IndexFlatIP(d)  # inner product
exact_index.add(vectors)

# --- ANN search (HNSW) ---
hnsw_index = faiss.IndexHNSWFlat(d, 32)  # 32 neighbors per node
hnsw_index.hnsw.efConstruction = 200      # build quality
hnsw_index.hnsw.efSearch = 64             # query quality (tune this)
hnsw_index.add(vectors)

# Compare: query with 5 random vectors
queries = np.random.randn(5, d).astype('float32')
import time

t0 = time.time()
D_exact, I_exact = exact_index.search(queries, 10)
print(f"Exact:  {(time.time()-t0)*1000:.1f}ms")

t0 = time.time()
D_ann, I_ann = hnsw_index.search(queries, 10)
print(f"HNSW:   {(time.time()-t0)*1000:.1f}ms")

# Measure recall: how many ANN results match exact results?
recall = np.mean([
    len(set(I_exact[i]) & set(I_ann[i])) / 10.0
    for i in range(5)
])
print(f"Recall@10: {recall:.1%}")

Follow-up Questions

How do you choose between HNSW and IVF?

Use HNSW when you need high recall and can afford the memory (stores full vectors plus the graph). Use IVF (often with product quantization) when memory is constrained or the index is very large (100M+ vectors). HNSW is the default choice for most production RAG systems under 50M vectors.

Does ANN recall loss actually affect RAG quality?

In most cases, no. A 97% recall@10 means you miss 0.3 of the true top-10 results on average. Since RAG systems typically retrieve 5-20 chunks and the reranker (see Topic 8: Reranking) further filters them, a small recall loss in the first stage rarely affects the final answer quality.

What about GPU-accelerated vector search?

Libraries like FAISS-GPU and cuVS (NVIDIA) can search billion-scale indexes at microsecond latency. GPU search is valuable for very high QPS (thousands of queries per second) or very large indexes. For most RAG applications, CPU-based HNSW is fast enough and simpler to operate.

Reranking

Reranking applies a more expensive relevance model to a shortlist returned by the first retriever. The initial retriever maximizes speed and recall; the reranker improves ordering so the best evidence reaches the generator. The pattern is bi-encoder retrieval followed by cross-encoder reranking.

💡 Retrieval is casting a wide net; reranking is picking the best fish from the catch. The net must be wide (high recall), but the chef only needs the best ones (high precision).

Two-Stage Architecture

The standard pattern is:

Stage 1 (Retriever): Bi-encoder retrieves top-100 candidates quickly using precomputed embeddings
Stage 2 (Reranker): Cross-encoder scores each candidate against the query with full attention, then returns the top-5 or top-10

This gives you the scalability of vector search and the precision of richer query-document interaction. The cross-encoder sees the query and document together, enabling deeper relevance assessment than independent embeddings can provide.

Why Reranking Works

Property	Bi-Encoder (Retriever)	Cross-Encoder (Reranker)
Input	Query and doc encoded separately	Query + doc as one input pair
Interaction	Dot product of independent vectors	Full attention between query and doc tokens
Speed	~1ms per query (precomputed)	~50ms per query-doc pair
Quality	Good recall, approximate relevance	Higher precision, fine-grained relevance
Scalability	Millions of documents	10-100 candidates per query

In Practice

In interviews, explain reranking as a second-stage quality filter. It is one of the highest-impact improvements you can add to a RAG system, often improving answer quality by 10-20% without changing the index or embedding model. Popular rerankers include Cohere Rerank, BGE-reranker, and cross-encoder models from Hugging Face.

→ Reranking is cheap insurance for RAG quality. It improves precision without reindexing, making it one of the highest-return improvements.

Python Example

from sentence_transformers import CrossEncoder

# Load a cross-encoder reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "What is the refund policy for international orders?"

# Candidates from first-stage retrieval (bi-encoder)
candidates = [
    "Shipping costs are non-refundable for all orders.",
    "International refunds take 10-15 business days.",
    "Our refund policy allows returns within 14 days.",
    "Contact support for order tracking information.",
    "International orders may incur customs duties.",
]

# Score each candidate against the query
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

# Rerank by cross-encoder score
ranked = sorted(
    zip(candidates, scores),
    key=lambda x: x[1],
    reverse=True
)
for doc, score in ranked:
    print(f"  {score:>6.3f}  {doc}")
# "International refunds take 10-15 business days" ranks #1

Follow-up Questions

How many candidates should you rerank?

Typical range is 20-100 candidates. Too few and you risk missing relevant documents. Too many and reranking latency becomes noticeable (cross-encoders process ~20 pairs per second). A common setup: retrieve top-50 with bi-encoder, rerank to get top-5 for the generator.

Can you use an LLM as a reranker?

Yes. LLM-based reranking (e.g., prompting GPT-4 to rank passages) can outperform cross-encoders on complex queries. However, it is 10-100x more expensive and slower. Use it when quality justifies the cost, or for offline evaluation to create training data for a lighter reranker.

Does reranking help when the first-stage retrieval is already good?

Almost always, yes. Even a strong bi-encoder retriever can rank a tangentially relevant document above a highly relevant one. Reranking consistently improves nDCG and MRR (see Topic 10: Retrieval Metrics), which directly translates to better evidence reaching the generator.

Query Rewriting

Query rewriting converts the user's raw question into a form better aligned with the indexed content. The system may expand acronyms, normalize jargon, add keywords, disambiguate entities, or split one complex query into multiple focused retrieval intents. It is often one of the cheapest ways to improve recall.

💡 Query rewriting is a translator between user language and index language — users say "the app is wonky" but the docs say "application performance degradation."

Original Query

"how do I fix the SSO thing when it breaks on mobile?"

↓ LLM Rewriter ↓

Expanded

"Troubleshoot Single Sign-On (SSO) authentication failures on mobile devices"

Keywords

"SSO mobile error SAML OAuth redirect loop iOS Android"

Decomposed

Q1: "SSO configuration for mobile apps"
Q2: "Common SSO error codes and fixes"

Why Users Write Bad Queries

Users do not naturally speak in index-friendly language. They use colloquialisms ("the SSO thing"), abbreviations, vague references ("it broke"), and compound questions that combine multiple intents. The retriever must bridge this gap.

Rewriting Strategies

Strategy	What It Does	When to Use
Acronym expansion	SSO → Single Sign-On	Technical domains with heavy jargon
Keyword injection	Add related terms for BM25	Hybrid retrieval systems
Query decomposition	Split complex query into sub-queries	Multi-part questions
Hypothetical answer	Generate what a good answer looks like, use it as query (HyDE)	Abstract or conceptual queries
Conversation context	Inject context from conversation history	Multi-turn chat RAG

Cost vs Impact

Query rewriting is often one of the cheapest ways to improve recall without re-embedding the corpus. A single LLM call to rewrite the query costs a few cents and can dramatically improve first-stage retrieval. In interviews, mention query rewriting as a high-leverage, low-cost intervention.

→ Query rewriting bridges the gap between user language and index language. It is cheap, fast, and often the highest-ROI retrieval improvement.

Python Example

import openai

def rewrite_query(user_query, client, chat_history=None):
    """Rewrite a user query for better retrieval."""
    context = ""
    if chat_history:
        context = "\n".join(
            f"{m['role']}: {m['content']}"
            for m in chat_history[-3:]
        )

    response = client.chat.completions.create(
        model="gpt-4o-mini",  # cheap and fast for rewrites
        messages=[{
            "role": "system",
            "content": """Rewrite the user's question to improve
document retrieval. Expand acronyms, add relevant
keywords, and resolve ambiguous references using
conversation context if provided. Output ONLY the
rewritten query, nothing else."""
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nOriginal: {user_query}"
        }],
        temperature=0.0,
        max_tokens=100,
    )
    return response.choices[0].message.content.strip()

# Example: "fix the SSO thing on mobile"
# -> "Troubleshoot Single Sign-On authentication failures on mobile"

Follow-up Questions

What is HyDE (Hypothetical Document Embeddings)?

HyDE asks an LLM to generate a hypothetical answer to the query, then uses that answer as the retrieval query. The intuition is that the hypothetical answer is closer in embedding space to real relevant documents than the original question is. It works well for abstract queries but adds latency and cost.

Does query rewriting add too much latency?

A fast LLM (GPT-4o-mini, Claude Haiku) can rewrite a query in 100-300ms. For most RAG applications, this is acceptable since the overall pipeline (retrieval + reranking + generation) takes 1-3 seconds. For ultra-low-latency needs, pre-compute rewrites or use a fine-tuned small model.

How do you handle multi-turn conversations in RAG?

The most important technique is query contextualization: rewrite the current query to include relevant context from the conversation history. "What about the mobile version?" becomes "What are the SSO configuration options for the mobile version of the app?" This is essentially query rewriting with conversation history as input.

Retrieval Metrics

Recall@k measures whether relevant evidence appears in the shortlist. MRR and nDCG measure whether relevant items appear near the top. The strongest interview answer is that retrieval metrics should not be isolated from generation outcomes — a retriever that looks strong offline may still fail the user task.

💡 Retrieval metrics are like a medical checkup for your RAG system. Recall checks if the right evidence was found. Ranking metrics check if it was prioritized correctly.

Click a metric to see details.

Retrieval Scorecard

Metric	What It Checks	Why It Matters
Recall@k	Relevant evidence appears in the candidate set	Low recall means the generator never sees the right facts
Precision@k	Returned context is mostly useful	High noise wastes context window and increases hallucination risk
MRR	Position of the first relevant result	Higher MRR means less noise before the answer
nDCG	Ranking quality among retrieved chunks	Strong reranking improves nDCG without reindexing
Freshness	Recent documents are retrievable	Prevents stale answers in policy and operational domains

Where Retrieval Quality Is Won or Lost

Component	Main Question	Typical Failure
Chunking	What unit should be retrieved?	Chunks too broad or too thin
Embeddings / Lexical	Can the system find likely evidence?	Semantic misses or exact-match misses
Metadata filters	Is the search in the right slice?	Wrong tenant, wrong date, wrong scope
Reranking	Are the best passages near the top?	Useful evidence buried too low
Prompt assembly	Does the model see enough clean support?	Context noise overwhelms the answer

Connecting Retrieval to Generation

Retrieval metrics should not be isolated from generation outcomes. A retriever that looks strong offline but feeds noisy evidence to the generator may still fail the user task. The best evaluation pipeline measures both retrieval quality (recall, nDCG) and end-to-end answer quality (factual accuracy, citation correctness, user satisfaction).

→ Measure retrieval and generation together. A retriever with perfect recall but poor ranking still produces poor answers.

Python Example

import numpy as np

def recall_at_k(retrieved_ids, relevant_ids, k):
    """Fraction of relevant docs found in top-k retrieved."""
    top_k = set(retrieved_ids[:k])
    relevant = set(relevant_ids)
    return len(top_k & relevant) / max(len(relevant), 1)

def mrr(retrieved_ids, relevant_ids):
    """Mean Reciprocal Rank: 1/position of first relevant result."""
    relevant = set(relevant_ids)
    for i, doc_id in enumerate(retrieved_ids):
        if doc_id in relevant:
            return 1.0 / (i + 1)
    return 0.0

def ndcg_at_k(retrieved_ids, relevant_ids, k):
    """Normalized Discounted Cumulative Gain at k."""
    relevant = set(relevant_ids)
    dcg = sum(
        (1.0 if retrieved_ids[i] in relevant else 0.0) / np.log2(i + 2)
        for i in range(min(k, len(retrieved_ids)))
    )
    ideal = sum(
        1.0 / np.log2(i + 2)
        for i in range(min(k, len(relevant_ids)))
    )
    return dcg / max(ideal, 1e-10)

# Example evaluation
retrieved = ["d3", "d7", "d1", "d5", "d2"]
relevant = ["d1", "d3"]

print(f"Recall@5: {recall_at_k(retrieved, relevant, 5):.2f}")
print(f"MRR:      {mrr(retrieved, relevant):.2f}")
print(f"nDCG@5:   {ndcg_at_k(retrieved, relevant, 5):.2f}")

Follow-up Questions

How do you build an evaluation dataset for retrieval?

Start with real user queries from logs. Have annotators mark which documents are relevant for each query. Start small (50-100 queries with 5-10 relevance judgments each) and grow over time. LLMs can help generate candidate relevance labels, but human verification is essential for the final dataset.

What is a good Recall@10 target for RAG?

For most RAG applications, Recall@10 above 0.85 is a reasonable target. Below 0.7, you will see frequent "I don't have enough information" responses or hallucinations. Above 0.95, you are likely in good shape and should focus on ranking quality (nDCG, MRR) rather than recall.

How do you detect retrieval regressions in production?

Build a regression test suite of critical queries with known relevant documents. Run this suite automatically after every index update, embedding model change, or chunking modification. Alert when metrics drop below baseline. Also monitor user signals: increased "thumbs down" or "not helpful" feedback often indicates retrieval degradation.

Retrieval Foundations for LLM Systems

What Is RAG?

Why RAG Exists

The RAG Pipeline

RAG Is a System, Not a Feature

Lexical vs Dense Retrieval

Lexical Retrieval (BM25, TF-IDF)

Dense Retrieval (Bi-encoder)

Trade-offs

Enterprise Reality

Hybrid Retrieval

Why Hybrid Wins

Fusion Strategies

Practical Guidance

Chunking Strategies

Why Chunking Dominates Quality

Chunking Approaches

Overlap Matters

Practical Guidelines

Metadata Filters

Why Filters Beat Better Embeddings

Common Filter Dimensions

Implementation

Vector Databases

What Vector Databases Provide

Popular Options

The Important Interview Point

Approximate Nearest Neighbor

Why ANN Exists

Popular ANN Algorithms

Tuning the Trade-off

Reranking

Two-Stage Architecture

Why Reranking Works

In Practice

Query Rewriting

Why Users Write Bad Queries

Rewriting Strategies

Cost vs Impact

Retrieval Metrics

Retrieval Scorecard

Where Retrieval Quality Is Won or Lost

Connecting Retrieval to Generation