Chapter 9 · 10 Topics

Production RAG Architectures & Grounded Answering

Moving beyond naive retrieve-and-generate demos to build RAG systems that are grounded, attributable, and production-ready.

A naive RAG demo retrieves a few chunks and inserts them into a prompt. A production RAG system manages much more: permissions, freshness, citation quality, caching, evaluation, failure handling, and escalation rules. This chapter covers the full architecture discipline — from retrieval and reranking through grounding controls and observability — so your RAG answers are not just fluent, but trustworthy.

Retrieval Foundations

Understanding the spectrum from toy demos to robust retrieval pipelines, including multi-step reasoning and hallucination control.

1

Naive RAG vs Production RAG

Naive RAG proves possibility; production RAG proves reliability. The gap between the two is where most real engineering work lives — ranking, permissions, freshness, citations, error recovery, observability, and feedback loops.
🧠 Mental model: Think of naive RAG as a library where you grab the first book off the shelf and read aloud. Production RAG is an entire reference-desk service: the librarian checks your credentials, picks the best sources, verifies publication dates, cites page numbers, and says "I don't know" when the shelves are empty.

The Spectrum of RAG Maturity

Naive RAG is the canonical demo: embed documents, do a nearest-neighbor search, stuff the top-k chunks into the prompt, and let the model answer. It works surprisingly well for prototypes but falls apart in production because it ignores ranking quality, document permissions, freshness, citation attribution, error handling, and feedback.

Production RAG treats each of those gaps as a first-class concern. The result is not a single prompt trick but a full application architecture with observable, testable, and recoverable behavior.

Production RAG Scorecard

LayerExample CheckWhy It Matters
RetrievalRelevant docs appear and rank wellWithout evidence recall, the answer starts from a weak base
GroundingResponse cites supporting passages correctlyPrevents fluent unsupported claims
FallbackSystem abstains when evidence is missingSafer than forcing confident guesses
OperationsLatency and freshness stay inside targetGrounded systems still need product-grade reliability

Key Architectural Differences

  • Ranking: Production systems rerank with multi-signal scoring (relevance, trust, freshness) rather than relying on raw embedding similarity alone.
  • Fallback & abstention: A production pipeline must know when not to answer, rather than hallucinating confidently.
  • Observability: Every stage — retrieval, reranking, prompt assembly, generation — should emit metrics and traces for debugging.
  • Feedback loops: User corrections, thumbs-up/down, and escalation data feed back into retrieval tuning and prompt iteration.
Interview signal: Explaining that naive RAG tests the idea while production RAG tests reliability, and then listing the specific controls (permissions, freshness, citation, fallback, evaluation) that bridge the gap.
Key takeaway: Production RAG is a grounded-answering pipeline with evidence selection, quality controls, and refusal behavior — not just a retrieval step attached to a model.
Python — Multi-signal reranking pattern
# Production reranking combines multiple signals beyond raw similarity.
# Weights are tunable per domain; freshness matters more for news,
# trust matters more for compliance docs.

def rank_candidate(candidate):
    # Semantic similarity from the embedding model (0-1)
    relevance = candidate["semantic_score"]
    # Trust score based on source authority (0-1)
    trust = candidate["source_trust"]
    # Freshness decay: newer docs score higher (0-1)
    freshness = candidate["freshness_score"]
    # Weighted combination - tune these for your domain
    return 0.65 * relevance + 0.20 * trust + 0.15 * freshness

# Sort candidates by composite score, best first
ranked = sorted(candidates, key=rank_candidate, reverse=True)

# Only pass top-k candidates that clear a minimum threshold
MIN_SCORE = 0.45
evidence = [c for c in ranked[:5] if rank_candidate(c) >= MIN_SCORE]
Follow-up Questions
How do you decide the right reranking weights?
Start with heuristics based on your domain (e.g., compliance needs higher trust weight), then tune on labeled evaluation sets. A/B testing on live traffic provides the final calibration. Log each signal independently so you can analyze which weights drive task success.
What is the biggest risk of naive RAG in production?
The model generates fluent, confident answers from irrelevant or outdated context. Without quality controls, users trust the output because it sounds good, even when the evidence does not support the claim. This is especially dangerous in regulated domains.
How does observability differ between naive and production RAG?
Naive RAG typically has no instrumentation. Production RAG emits per-stage metrics: retrieval latency, number of candidates, rerank score distributions, prompt token count, generation latency, citation match rate, and abstention rate. These traces let teams pinpoint whether a failure is a retrieval problem, a ranking problem, or a generation problem.
2

Single-Hop & Multi-Hop Retrieval

Single-hop retrieval answers one information need in one pass. Multi-hop retrieval connects multiple facts across documents iteratively — improving coverage but raising orchestration complexity and error compounding.
🧠 Mental model: Single-hop is asking one librarian one question. Multi-hop is a research project: you ask librarian A for a name, take that name to librarian B for context, then combine both answers into a coherent report.

When Single-Hop Falls Short

Many real questions require connecting information across documents. "What was the revenue of the company that acquired Startup X in 2024?" requires first identifying the acquirer, then looking up its financials. Single-hop retrieval will likely return documents about Startup X but miss the acquiring company's revenue figures.

Multi-Hop Retrieval Patterns

  • Query decomposition: Break the original question into sub-questions, retrieve evidence for each, and synthesize. This is the most common pattern.
  • Iterative retrieval: Use intermediate answers to formulate follow-up queries. Each hop refines or extends the evidence set.
  • Entity-chain retrieval: Extract entities from first-pass results and use them as queries for subsequent passes.

Trade-offs

DimensionSingle-HopMulti-Hop
LatencyFast: one retrieval roundSlower: multiple sequential retrievals
CoverageLimited to directly relevant docsCan bridge facts across documents
Error riskOne point of failureErrors compound across hops
ComplexitySimple pipelineRequires orchestration and planning
Interview signal: More complex questions require planning, decomposition, and iterative retrieval rather than one nearest-neighbor lookup. But not every question needs multi-hop — show restraint by matching complexity to the actual task.
Key takeaway: Multi-hop retrieval improves coverage for complex questions but introduces orchestration overhead and error compounding. Use it when the question genuinely requires connecting multiple facts, not as a default.
Python — Simple multi-hop retrieval loop
# Multi-hop retrieval: decompose, retrieve, refine, synthesize.
# Each hop uses previous results to formulate better queries.

def multi_hop_retrieve(question, retriever, llm, max_hops=3):
    # Step 1: Ask the LLM to decompose the question
    sub_questions = llm.decompose(question)
    all_evidence = []

    for i, sq in enumerate(sub_questions[:max_hops]):
        # Step 2: Retrieve evidence for each sub-question
        docs = retriever.search(sq, top_k=3)
        all_evidence.extend(docs)

        # Step 3: Optionally refine next query using current evidence
        if i < len(sub_questions) - 1:
            # Inject current findings into the next sub-question
            context_summary = llm.summarize(docs)
            sub_questions[i + 1] = llm.refine_query(
                sub_questions[i + 1], context_summary
            )

    # Step 4: Deduplicate and synthesize final answer
    unique_evidence = deduplicate(all_evidence)
    return llm.synthesize(question, unique_evidence)
Follow-up Questions
How do you prevent error compounding across hops?
Use confidence thresholds at each hop: if intermediate evidence is below a quality bar, abort or fall back to a simpler strategy. Validate extracted entities before using them as queries. Log intermediate results so you can trace which hop introduced noise.
How do you decide between single-hop and multi-hop at query time?
A query classifier or lightweight LLM call can predict whether the question requires decomposition. Simple factoid questions route to single-hop; comparison, reasoning, or temporal questions route to multi-hop. This avoids paying the latency cost for simple queries.
How does multi-hop retrieval relate to agentic RAG?
Multi-hop is one capability of agentic RAG. Agentic RAG (see Topic 7: Agentic RAG) goes further by adding tool selection, dynamic planning, and self-evaluation. Multi-hop is the retrieval pattern; agentic RAG is the orchestration paradigm.
3

Reducing Hallucinations in RAG

Hallucination in RAG is often a retrieval and prompting problem before it is a decoding problem. If the context is wrong, thin, stale, or noisy, the generator will sound confident anyway. Fix the evidence first.
🧠 Mental model: Asking a model to "hallucinate less" is like asking a student to write a better essay without giving them better source material. The fix is upstream: better sources, stricter citation rules, and permission to say "I don't know."

The Hallucination Reduction Stack

There is no single switch to eliminate hallucination. Instead, production systems layer multiple defenses:

  1. Improve retrieval recall: If relevant documents are not retrieved, the model has no grounding material. Better embeddings, hybrid search (dense + sparse), and query expansion help.
  2. Rerank aggressively: Push the most trustworthy, relevant evidence to the top. Filter out noisy or tangential results (see Topic 1: Naive vs Production RAG).
  3. Constrain to cited evidence: Instruct the model to only make claims supported by the provided context, and to cite specific passages.
  4. Require abstention: When evidence is weak or missing, the system should refuse to answer rather than guess. This is a policy decision, not a model capability.
  5. Separate grounded vs ungrounded generation: Clearly distinguish factual claims (which must be cited) from general reasoning or hedging language.

Where Hallucination Comes From

SourceSymptomFix
Poor retrievalIrrelevant context, model fills gaps from weightsBetter embeddings, hybrid search, query rewriting
Thin contextNot enough evidence, model over-generalizesRetrieve more chunks, use multi-hop when needed
Stale dataOutdated facts presented confidentlyFreshness scoring, index refresh (see Topic 5)
Noisy contextContradictory or irrelevant passages confuse generationAggressive reranking, deduplication, filtering
Weak promptingNo citation requirement, no abstention instructionExplicit grounding instructions, structured output
Interview signal: Frame hallucination as a systems problem, not a model problem. The strongest answers list concrete interventions across retrieval, ranking, prompting, and policy layers.
Key takeaway: Do not ask the model to be "less hallucinatory" in the abstract. Instead, improve retrieval recall, rerank aggressively, constrain answers to cited evidence, and require abstention when support is weak.
Python — Grounding enforcement with abstention
# A simple grounding check: if the generated answer cannot be
# traced back to retrieved evidence, flag it for abstention.

def grounded_generate(question, evidence, llm):
    # Instruct the model to cite evidence and abstain if unsupported
    system_prompt = (
        "Answer the question using ONLY the provided evidence. "
        "Cite evidence by [doc_id]. If the evidence does not support "
        "a confident answer, respond with: 'I don't have enough "
        "information to answer this reliably.'"
    )

    response = llm.generate(
        system=system_prompt,
        user=f"Question: {question}\n\nEvidence:\n{format_evidence(evidence)}"
    )

    # Post-generation check: does the answer cite at least one source?
    if not has_citations(response):
        # No citations found - likely ungrounded, trigger fallback
        return {"answer": None, "status": "abstained", "reason": "no_citations"}

    return {"answer": response, "status": "grounded"}
Follow-up Questions
Can you completely eliminate hallucination in RAG?
No. Language models are probabilistic generators, and even with perfect evidence, they may paraphrase in ways that subtly distort meaning. The goal is to reduce, detect, and contain hallucination — not to claim it is impossible. Abstention and citation auditing are the strongest practical defenses.
How do you measure hallucination rate in production?
Use automated groundedness checks (NLI models or LLM-as-judge) to verify claims against retrieved evidence. Combine with human annotation on a sample. Track the rate of answers that contain unsupported claims over time. This becomes a key operational metric alongside retrieval recall and user satisfaction.
Grounding & Trust

How citations, freshness policies, and access control turn retrieval outputs into answers that users and auditors can trust.

4

Citations & Provenance

Citations make the answer inspectable. They let users and auditors verify where a claim came from and whether the supporting source actually says what the answer claims. Provenance is a control mechanism, not just a UX feature.
🧠 Mental model: Citations are the "show your work" requirement from math class applied to AI. Without them, a correct answer is indistinguishable from a plausible fabrication. Provenance turns trust from "I believe the model" into "I can check the source."

Why Provenance Is a Control Mechanism

In enterprise, legal, medical, and compliance-heavy environments, a fluent answer without attribution is a liability. Citations serve multiple roles:

  • User trust: Readers can click through and verify claims against the original source.
  • Auditability: Compliance teams can review what evidence the system relied on for a given answer.
  • Debugging: When an answer is wrong, citations show whether the fault lies in retrieval (wrong source), generation (misinterpreted source), or both.
  • Feedback signal: Citation click-through rates indicate whether users find the sources useful.

Citation Implementation Patterns

PatternHow It WorksStrength
Inline referencesModel emits [1], [2] tags mapped to source listFamiliar to users, easy to verify
Passage-level groundingEach claim links to the exact passage that supports itFine-grained auditability
Post-hoc attributionAfter generation, a separate model maps claims to evidenceWorks with models that do not cite natively
Structured outputResponse schema includes claims + source fieldsMachine-parseable, easy to validate
Interview signal: Provenance is not a nice-to-have — it is a control mechanism that increases trust, simplifies debugging, and makes human review faster because the evidence trail is visible.
Key takeaway: Citations transform RAG from a black-box answer machine into an inspectable system where every claim can be traced to its source.
Python — Structured citation output
# Generate an answer with structured citations.
# Each claim maps to the evidence passage that supports it.

import json

def generate_cited_answer(question, evidence_docs, llm):
    # Build a schema-enforced prompt for citation
    schema_instruction = """Respond in JSON:
{
  "answer": "your full answer text with [1], [2] markers",
  "citations": [
    {"id": 1, "doc_id": "...", "passage": "exact quote", "claim": "what it supports"}
  ],
  "confidence": "high | medium | low",
  "unsupported_claims": ["any claims you could not ground"]
}"""

    # Format evidence with doc IDs for reference
    evidence_text = "\n".join(
        f"[Doc {d['id']}]: {d['text']}"
        for d in evidence_docs
    )

    raw = llm.generate(
        system=schema_instruction,
        user=f"Question: {question}\n\nEvidence:\n{evidence_text}"
    )

    # Parse and validate the structured response
    result = json.loads(raw)
    # Flag answers where no citations were produced
    if not result.get("citations"):
        result["confidence"] = "low"
    return result
Follow-up Questions
How do you verify that citations are accurate, not just present?
Use a natural language inference (NLI) model to check whether the cited passage actually entails the claim. Alternatively, use an LLM-as-judge to evaluate faithfulness. Presence of a citation number is necessary but not sufficient — the citation must semantically support the claim.
What happens when the same fact appears in multiple sources?
Prefer the most authoritative and most recent source. If sources conflict, surface the conflict to the user rather than silently picking one. Deduplication logic should preserve provenance metadata so the citation trail remains accurate.
5

Freshness & Knowledge Updates

Freshness should be handled in the data layer, not by hoping the base model knows recent facts. A production system needs ingestion schedules, document versioning, deletion policies, and index refresh procedures so retrieval reflects the current source of truth.
🧠 Mental model: A RAG index is like a newspaper archive. If you never add today's paper, the system can only answer with yesterday's news. Freshness policies are your subscription service: they decide how quickly new editions arrive and when old ones get retired.

The Freshness Problem

Base language models have a training knowledge cutoff. RAG was partly invented to bridge this gap — but the bridge only works if the document index is kept current. A stale index recreates the same problem RAG was meant to solve.

Freshness Controls

  • Ingestion schedules: Define how frequently new or updated documents are processed and indexed. Real-time for critical sources, batch for stable references.
  • Document versioning: Track which version of a document is in the index. When a document is updated, the old version should be replaced or marked superseded.
  • Deletion policies: Remove or tombstone documents that are no longer valid. A retracted policy document should not appear as evidence.
  • Freshness scoring: Weight more recent documents higher in ranking (see Topic 1: Naive vs Production RAG) so the system naturally prefers current information.
  • Uncertainty communication: If the system cannot guarantee freshness for a request, it should communicate uncertainty or route to a more reliable source instead of inventing confidence.
Stale-answer risk: A confident answer from outdated evidence can be worse than no answer at all. Freshness is not optional — it is a correctness requirement for any domain where facts change.
Key takeaway: Freshness is a data-layer responsibility. If the system cannot guarantee its evidence is current, it should communicate that uncertainty rather than confidently citing outdated sources.
Python — Freshness-aware retrieval filter
# Filter and score documents by freshness before ranking.
# Ensures stale evidence does not dominate the context window.

from datetime import datetime, timedelta

def freshness_score(doc, max_age_days=90):
    # Calculate how fresh the document is (0 = expired, 1 = brand new)
    age = (datetime.now() - doc["last_updated"]).days
    if age > max_age_days:
        return 0.0  # Beyond max age, treat as stale
    return 1.0 - (age / max_age_days)

def filter_stale(docs, min_freshness=0.1):
    # Remove documents below the freshness threshold
    fresh = [d for d in docs if freshness_score(d) >= min_freshness]
    if not fresh:
        # All docs are stale - signal uncertainty upstream
        return [], "all_evidence_stale"
    return fresh, "ok"
Follow-up Questions
How do you handle documents that are old but still authoritative?
Use document-type metadata to distinguish evergreen content (legal statutes, scientific constants) from time-sensitive content (news, quarterly reports). Freshness decay curves should be configurable per document type — a physics textbook does not decay the same way a press release does.
What is the cost of real-time index updates vs batch?
Real-time ingestion (streaming changes into the index) gives the freshest data but costs more in compute and complexity. Batch ingestion (nightly or hourly refreshes) is simpler but introduces a freshness lag. Most production systems use a tiered approach: real-time for critical sources, batch for the long tail.
6

Permissions & Access Control

A RAG system should never retrieve documents the current user is not allowed to see. Access control must be enforced before or during retrieval, not only at final display time — otherwise the model may leak restricted content through generation.
🧠 Mental model: Access control in RAG is like a security door on the library stacks, not a blackout marker on the photocopy. If the model has already read the restricted document, redacting the output is too late — traces leak through paraphrasing and reasoning.

Why Display-Time Filtering Is Insufficient

If access control is only applied after generation, the model has already processed restricted content. It may paraphrase confidential information, use restricted facts in its reasoning chain, or subtly reference protected data. The damage is done at retrieval time, not display time.

Enforcement Patterns

  • Permission-aware indexing: Tag every document with ACLs (access control lists) at ingestion time. Retrieval queries include user permission metadata as a filter.
  • Pre-retrieval filtering: Before the similarity search runs, apply metadata filters that exclude documents the user cannot access.
  • Tenant isolation: In multi-tenant systems, maintain separate indices or strict partition keys per tenant so cross-tenant data leakage is structurally impossible.
  • Prompt instructions as defense-in-depth: Telling the model "do not reveal secrets" is a backup, not a primary control. It fails under prompt injection and adversarial queries.

Common Mistakes

MistakeRiskCorrection
Filter at display onlyModel already saw restricted contentFilter at retrieval time
Rely on prompt instructionsBypassable via prompt injectionUse architectural controls
Shared cache across usersUser A sees User B's resultsPermission-scoped caching
Stale ACL metadataRevoked access still worksSync ACLs on ingestion refresh
Interview signal: Security lives in the retrieval layer. Permission-aware indexing, metadata filtering, and tenant isolation are as important as prompt instructions that say "do not reveal secrets."
Key takeaway: Enforce access control at retrieval time, not display time. If the model has already seen restricted data, no amount of output filtering can guarantee it will not leak through generation.
Python — Permission-scoped retrieval
# Enforce document permissions at retrieval time.
# Never let the model see documents the user cannot access.

def permission_scoped_search(query, user, vector_store):
    # Get the user's permission groups
    user_groups = user["access_groups"]  # e.g., ["engineering", "public"]

    # Build a metadata filter that restricts to allowed documents
    acl_filter = {
        "access_groups": {"$in": user_groups}
    }

    # The vector search only sees documents matching the ACL filter
    results = vector_store.similarity_search(
        query=query,
        k=10,
        filter=acl_filter  # Pre-retrieval: restricted docs never enter the results
    )

    # Double-check: log any result without ACL metadata as anomalous
    for r in results:
        if not r.metadata.get("access_groups"):
            logger.warning(f"Doc {r.id} missing ACL metadata")

    return results
Follow-up Questions
How do caching layers interact with access control?
Caches must be permission-scoped. A cached answer generated from documents User A can see must not be served to User B if User B lacks those permissions. Cache keys should include the user's permission scope. See Topic 8: Caching Layers for more.
What about documents with mixed sensitivity within a single file?
This requires chunk-level ACLs rather than document-level. If a document contains both public and restricted sections, chunking should preserve section-level access metadata. This is more complex but necessary for documents like redacted legal filings or partially classified reports.
Advanced Patterns

More sophisticated retrieval and optimization strategies for complex tasks and high-traffic production systems.

7

Agentic RAG

Agentic RAG extends retrieval beyond one fixed lookup. The system may rewrite queries, choose among tools, perform multiple retrieval steps, inspect intermediate results, and decide whether more evidence is needed before answering.
🧠 Mental model: Standard RAG is a vending machine: insert query, receive answer. Agentic RAG is a research assistant: it reads the question, decides which databases to check, evaluates the results, and goes back for more information if the first pass was insufficient.

What Makes RAG "Agentic"

The key distinction is autonomy in retrieval strategy. A standard RAG pipeline has a fixed sequence: embed query, search index, rerank, generate. An agentic pipeline adds decision points where the model can:

  • Rewrite or decompose the query before retrieval
  • Choose among multiple retrieval tools (vector search, keyword search, SQL query, API call)
  • Evaluate intermediate results and decide whether to retrieve more
  • Route to different generation strategies based on evidence quality

When Agentic RAG Is Worth the Complexity

Use CaseWhy Agentic HelpsStandard RAG Limitation
Multi-step questionsDecomposes and retrieves iterativelySingle-hop misses linked facts
Heterogeneous sourcesChooses the right tool per sub-taskOnly queries one index type
Ambiguous queriesRewrites for clarity before retrievalRetrieves with the original vague query
Quality-sensitive domainsSelf-evaluates and retries on weak evidenceReturns whatever it finds first

When to Show Restraint

Not every retrieval workflow needs agent behavior. Agentic RAG adds latency, complexity, and failure paths. For simple factoid lookups, it is engineering overkill. The interview-strength answer is knowing when to reach for it, not defaulting to it.

Interview signal: Agentic RAG is powerful when the question requires decomposition or tool use, but it can add latency and failure paths for simple tasks. Show you know the trade-off.
Key takeaway: Agentic RAG gives the model autonomy to plan, select tools, and iterate on retrieval. Use it when questions are genuinely complex, not as a default for simple lookups.
Python — Agentic retrieval loop with tool selection
# Agentic RAG: the model decides which tools to use and
# whether evidence is sufficient before generating.

def agentic_rag(question, tools, llm, max_steps=5):
    # Available tools: vector_search, sql_query, api_lookup, etc.
    context = []
    plan = llm.plan(question, available_tools=tools)

    for step in plan.steps[:max_steps]:
        # The model chose which tool and query to use
        tool = tools[step.tool_name]
        result = tool.execute(step.query)
        context.append({"tool": step.tool_name, "result": result})

        # After each step, ask: do we have enough evidence?
        sufficiency = llm.evaluate_evidence(question, context)
        if sufficiency.is_sufficient:
            break  # Evidence is good enough, proceed to generation

    # Generate the final answer using all collected evidence
    return llm.generate(question, context)
Follow-up Questions
How do you prevent agentic loops from running forever?
Set a maximum step count and a time budget. If neither evidence sufficiency nor the step limit is reached, force a best-effort answer with an uncertainty flag. Log loop traces so you can identify questions that consistently exhaust the budget.
How does agentic RAG relate to multi-hop retrieval?
Multi-hop retrieval (see Topic 2: Single-Hop & Multi-Hop Retrieval) is one capability within agentic RAG. Agentic RAG is the broader paradigm that also includes tool selection, query rewriting, and self-evaluation. You can do multi-hop without full agent behavior, but agentic RAG typically involves multi-hop.
What are the failure modes unique to agentic RAG?
Key risks include tool selection errors (model picks the wrong tool), query drift (reformulated queries diverge from the original intent), evidence pollution (later hops retrieve noise that dilutes good early evidence), and latency blowup (too many retrieval steps for the user's patience). Each needs monitoring and guardrails.
8

Caching Layers

Caching reduces latency and cost by reusing expensive results — embeddings, retrieval outputs, reranked sets, or final answers — for repeated or near-duplicate queries. But caching must respect freshness and permissions, or it becomes a liability.
🧠 Mental model: Caching in RAG is like a fast-food counter with pre-made sandwiches. They are quick and cheap to serve — but only if the ingredients have not expired and the customer does not have an allergy. A fast cached answer that is stale or leaks across user scopes is worse than a slower fresh one.

What to Cache in RAG

Cache LayerWhat Is CachedBenefitInvalidation Challenge
Embedding cacheQuery → embedding vectorAvoids re-computing embeddingsLow risk; embeddings are deterministic
Retrieval cacheQuery → top-k documentsSkips vector search entirelyMust invalidate when index changes
Reranked set cacheQuery → reranked candidatesSkips expensive rerankingMust respect freshness and signal changes
Answer cacheQuery → final generated answerSkips generation entirelyHighest risk: stale answers, permission leaks

Cache Safety Rules

  • Freshness TTLs: Every cache entry needs a time-to-live aligned with the source data's update frequency. A news corpus needs minute-level TTLs; a legal reference might tolerate daily.
  • Permission-scoped keys: Cache keys must include the user's permission scope. User A's cached answer must never be served to User B if they have different access levels (see Topic 6: Permissions & Access Control).
  • Semantic deduplication: Near-duplicate queries (e.g., "What is RAG?" vs "What does RAG mean?") can share cache entries if a similarity threshold is met, but this needs careful tuning.
Interview signal: Caching must respect freshness and permissions. A fast cached answer that is stale or leaked across user scopes is worse than a slower uncached answer.
Key takeaway: Cache aggressively for performance, but scope every cache entry by freshness TTL and user permissions. Speed without safety is a liability.
Python — Permission-scoped answer cache
# A permission-aware cache for RAG answers.
# Cache keys include user scope to prevent cross-user leaks.

import hashlib, time

class ScopedAnswerCache:
    def __init__(self, ttl_seconds=300):
        self.cache = {}
        self.ttl = ttl_seconds

    def _key(self, query, user_scope):
        # Include user permissions in the cache key
        scope_str = ",".join(sorted(user_scope))
        raw = f"{query}||{scope_str}"
        return hashlib.sha256(raw.encode()).hexdigest()

    def get(self, query, user_scope):
        key = self._key(query, user_scope)
        entry = self.cache.get(key)
        if entry and (time.time() - entry["ts"]) < self.ttl:
            return entry["answer"]  # Cache hit, within TTL
        return None  # Cache miss or expired

    def put(self, query, user_scope, answer):
        key = self._key(query, user_scope)
        self.cache[key] = {"answer": answer, "ts": time.time()}
Follow-up Questions
How do you handle cache invalidation when documents are updated?
Use event-driven invalidation: when a document is re-indexed, broadcast an invalidation event that clears all cache entries that referenced that document. This requires tracking which documents contributed to each cached answer — a form of provenance for the cache layer.
Is semantic caching worth the complexity?
It depends on query distribution. If many users ask similar but not identical questions, semantic caching (matching queries by embedding similarity) can dramatically improve hit rates. But it introduces false-positive risk: two semantically similar queries might need different answers if context differs. Start with exact-match caching and graduate to semantic only if hit rates justify it.
Operations & Decision-Making

Evaluating RAG systems rigorously and knowing when RAG is not the right tool for the job.

9

Evaluation: Offline & Online

Offline evaluation checks retrieval relevance, groundedness, citation correctness, and answer quality on curated test sets. Online evaluation tracks live user satisfaction, task completion, and escalation behavior. Both are needed because offline wins do not always survive contact with real traffic.
🧠 Mental model: Offline evaluation is a dress rehearsal; online evaluation is opening night. The dress rehearsal catches obvious problems, but only the live audience reveals how the system performs under unpredictable conditions.

Offline Evaluation

Run the RAG pipeline against a curated test set with known-good answers and source documents. Measure:

  • Retrieval recall/precision: Are the right documents being retrieved? Are irrelevant ones excluded?
  • Groundedness: Does the generated answer only make claims supported by the retrieved evidence?
  • Citation correctness: Do citations point to passages that actually support the cited claim?
  • Answer quality: Overall accuracy, completeness, and usefulness as judged by human reviewers or LLM-as-judge.
  • Abstention correctness: Does the system refuse to answer when it should?

Online Evaluation

Observe the system in production with real users:

  • User satisfaction: Thumbs-up/down, star ratings, NPS-style surveys.
  • Task completion: Did the user accomplish what they came for?
  • Correction rate: How often do users rephrase and retry?
  • Escalation rate: How often do users abandon the AI and contact a human?
  • Citation click-through: Are users actually verifying sources?

Error Decomposition

Error TypeSymptomDiagnostic
Retrieval errorRight answer exists but was not retrievedCheck recall metrics on known-good docs
Ranking errorRight doc retrieved but ranked too lowInspect rank positions of ground-truth docs
Prompt errorRight docs in context but instructions unclearA/B test prompt variations with same context
Generation errorRight docs, right prompt, but model hallucinatesGroundedness score comparison across models
Interview signal: Evaluation should separate retrieval errors, prompt errors, and generation errors. Otherwise the team cannot tell where to intervene.
Key takeaway: Both offline and online evaluation are required. The strongest evaluation frameworks decompose errors by source (retrieval, ranking, prompt, generation) so teams can intervene at the right layer.
Python — Offline evaluation harness
# A minimal offline evaluation harness for RAG.
# Measures retrieval recall, groundedness, and answer quality.

def evaluate_rag(test_cases, rag_pipeline, judge_llm):
    results = []
    for tc in test_cases:
        # Run the RAG pipeline on the test question
        output = rag_pipeline.run(tc["question"])

        # Metric 1: Did we retrieve the expected source documents?
        retrieved_ids = {d["id"] for d in output["retrieved_docs"]}
        expected_ids = set(tc["expected_doc_ids"])
        recall = len(retrieved_ids & expected_ids) / max(len(expected_ids), 1)

        # Metric 2: Is the answer grounded in the evidence?
        groundedness = judge_llm.score_groundedness(
            answer=output["answer"],
            evidence=output["retrieved_docs"]
        )

        # Metric 3: Overall answer quality vs reference
        quality = judge_llm.score_quality(
            answer=output["answer"],
            reference=tc["reference_answer"]
        )

        results.append({
            "question": tc["question"],
            "retrieval_recall": recall,
            "groundedness": groundedness,
            "quality": quality
        })
    return results
Follow-up Questions
How large should the offline test set be?
Large enough to cover your key query types, edge cases, and failure modes. Start with 50-100 diverse, representative examples. As the system matures, grow the set by adding cases from real-user failures and regression bugs. Quality and diversity matter more than raw count.
How do you handle the "LLM-as-judge" reliability problem?
LLM judges are useful for scale but have known biases (verbosity preference, position bias). Calibrate by comparing LLM judgments against human annotations on a sample. Use structured rubrics with specific criteria rather than open-ended "rate this answer" prompts. Treat LLM-as-judge as a screening tool, not ground truth.
10

When Not to Use RAG

Do not use RAG when the task depends mostly on stable procedural logic, deterministic computations, or data best accessed through structured APIs. RAG is also a poor fit when documents are too noisy for reliable retrieval or when search plus templates solves the problem more simply.
🧠 Mental model: RAG is a power drill — excellent for the right job, but you do not use it to turn a screw that just needs a screwdriver. Knowing when not to use RAG signals the same engineering maturity as knowing how to build one well.

When RAG Is Not the Answer

ScenarioWhy RAG Is a Poor FitBetter Alternative
Deterministic computationMath, logic, and data transformations do not need retrievalCode execution, SQL, calculators
Structured data queriesTabular data is better queried via SQL than chunked into vectorsText-to-SQL, direct API access
Stable procedural tasksStep-by-step workflows do not change per queryHardcoded logic, decision trees, templates
Noisy document corpusLow-quality sources yield unreliable retrievalCurate data first, or use structured extraction
Simple search sufficesUser just needs to find a document, not get an answerTraditional search with snippets

The Adequacy Principle

The best solution is the one that is adequate, controllable, and maintainable. RAG adds retrieval infrastructure, embedding pipelines, index management, and generation complexity. If the task can be solved with a simpler architecture that meets the quality bar, prefer simplicity.

Signs You Are Over-Engineering

  • The retrieval step returns the same few documents for every query (the corpus is too small or narrow).
  • Users never read the AI-generated answer — they always click through to the source document.
  • The generation step adds no value beyond what the raw search snippet provides.
  • Maintaining the index and pipeline costs more engineering time than the feature saves.
Interview signal: This answer signals maturity. Strong engineers know when not to introduce a more complex architecture. The best solution is the one that is adequate, controllable, and maintainable.
Key takeaway: Knowing when not to use RAG is as important as knowing how to build it. Reach for RAG when you need grounded, attributable answers from unstructured knowledge — not as a default for every LLM application.
Python — Decision framework: RAG vs alternatives
# A simple decision function to route tasks to the right architecture.
# Not every LLM task needs RAG; use the simplest adequate approach.

def choose_architecture(task):
    # Check if the task is better served by deterministic tools
    if task.requires_computation:
        return "code_execution"  # Math, data transforms, logic

    if task.data_is_structured:
        return "text_to_sql"  # Tabular data, databases

    if task.workflow_is_static:
        return "template_engine"  # Fixed procedures, decision trees

    if task.corpus_too_noisy:
        return "curate_first"  # Clean data before building RAG

    if task.search_suffices:
        return "traditional_search"  # Snippets are enough

    # RAG is appropriate: unstructured knowledge, grounding needed
    if task.needs_grounded_answers and task.has_quality_corpus:
        return "rag"

    return "evaluate_further"  # Edge case - needs human judgment
Follow-up Questions
What if stakeholders insist on using RAG for a task where it is not a good fit?
Build a small proof-of-concept with clear evaluation metrics, then demonstrate quantitatively that the simpler approach meets the quality bar at lower cost and complexity. Data wins arguments. If RAG adds no measurable quality improvement, the maintenance burden is pure overhead.
Can you combine RAG with non-RAG approaches?
Absolutely. Many production systems use a router that sends deterministic queries to code execution, structured queries to SQL, and knowledge questions to RAG. This hybrid approach gives you the best tool for each sub-task without forcing everything through the retrieval pipeline.
How do you decide between RAG and fine-tuning?
RAG is best when the knowledge is external, changing, and must be cited. Fine-tuning is best when the model needs to learn a style, format, or domain-specific behavior that does not change frequently. They are complementary: you can fine-tune a model for tone and format while using RAG for factual grounding.