Understanding the spectrum from toy demos to robust retrieval pipelines, including multi-step reasoning and hallucination control.
Naive RAG vs Production RAG
The Spectrum of RAG Maturity
Naive RAG is the canonical demo: embed documents, do a nearest-neighbor search, stuff the top-k chunks into the prompt, and let the model answer. It works surprisingly well for prototypes but falls apart in production because it ignores ranking quality, document permissions, freshness, citation attribution, error handling, and feedback.
Production RAG treats each of those gaps as a first-class concern. The result is not a single prompt trick but a full application architecture with observable, testable, and recoverable behavior.
Production RAG Scorecard
| Layer | Example Check | Why It Matters |
|---|---|---|
| Retrieval | Relevant docs appear and rank well | Without evidence recall, the answer starts from a weak base |
| Grounding | Response cites supporting passages correctly | Prevents fluent unsupported claims |
| Fallback | System abstains when evidence is missing | Safer than forcing confident guesses |
| Operations | Latency and freshness stay inside target | Grounded systems still need product-grade reliability |
Key Architectural Differences
- Ranking: Production systems rerank with multi-signal scoring (relevance, trust, freshness) rather than relying on raw embedding similarity alone.
- Fallback & abstention: A production pipeline must know when not to answer, rather than hallucinating confidently.
- Observability: Every stage — retrieval, reranking, prompt assembly, generation — should emit metrics and traces for debugging.
- Feedback loops: User corrections, thumbs-up/down, and escalation data feed back into retrieval tuning and prompt iteration.
Python — Multi-signal reranking pattern
# Production reranking combines multiple signals beyond raw similarity. # Weights are tunable per domain; freshness matters more for news, # trust matters more for compliance docs. def rank_candidate(candidate): # Semantic similarity from the embedding model (0-1) relevance = candidate["semantic_score"] # Trust score based on source authority (0-1) trust = candidate["source_trust"] # Freshness decay: newer docs score higher (0-1) freshness = candidate["freshness_score"] # Weighted combination - tune these for your domain return 0.65 * relevance + 0.20 * trust + 0.15 * freshness # Sort candidates by composite score, best first ranked = sorted(candidates, key=rank_candidate, reverse=True) # Only pass top-k candidates that clear a minimum threshold MIN_SCORE = 0.45 evidence = [c for c in ranked[:5] if rank_candidate(c) >= MIN_SCORE]
How do you decide the right reranking weights?
What is the biggest risk of naive RAG in production?
How does observability differ between naive and production RAG?
Single-Hop & Multi-Hop Retrieval
When Single-Hop Falls Short
Many real questions require connecting information across documents. "What was the revenue of the company that acquired Startup X in 2024?" requires first identifying the acquirer, then looking up its financials. Single-hop retrieval will likely return documents about Startup X but miss the acquiring company's revenue figures.
Multi-Hop Retrieval Patterns
- Query decomposition: Break the original question into sub-questions, retrieve evidence for each, and synthesize. This is the most common pattern.
- Iterative retrieval: Use intermediate answers to formulate follow-up queries. Each hop refines or extends the evidence set.
- Entity-chain retrieval: Extract entities from first-pass results and use them as queries for subsequent passes.
Trade-offs
| Dimension | Single-Hop | Multi-Hop |
|---|---|---|
| Latency | Fast: one retrieval round | Slower: multiple sequential retrievals |
| Coverage | Limited to directly relevant docs | Can bridge facts across documents |
| Error risk | One point of failure | Errors compound across hops |
| Complexity | Simple pipeline | Requires orchestration and planning |
Python — Simple multi-hop retrieval loop
# Multi-hop retrieval: decompose, retrieve, refine, synthesize. # Each hop uses previous results to formulate better queries. def multi_hop_retrieve(question, retriever, llm, max_hops=3): # Step 1: Ask the LLM to decompose the question sub_questions = llm.decompose(question) all_evidence = [] for i, sq in enumerate(sub_questions[:max_hops]): # Step 2: Retrieve evidence for each sub-question docs = retriever.search(sq, top_k=3) all_evidence.extend(docs) # Step 3: Optionally refine next query using current evidence if i < len(sub_questions) - 1: # Inject current findings into the next sub-question context_summary = llm.summarize(docs) sub_questions[i + 1] = llm.refine_query( sub_questions[i + 1], context_summary ) # Step 4: Deduplicate and synthesize final answer unique_evidence = deduplicate(all_evidence) return llm.synthesize(question, unique_evidence)
How do you prevent error compounding across hops?
How do you decide between single-hop and multi-hop at query time?
How does multi-hop retrieval relate to agentic RAG?
Reducing Hallucinations in RAG
The Hallucination Reduction Stack
There is no single switch to eliminate hallucination. Instead, production systems layer multiple defenses:
- Improve retrieval recall: If relevant documents are not retrieved, the model has no grounding material. Better embeddings, hybrid search (dense + sparse), and query expansion help.
- Rerank aggressively: Push the most trustworthy, relevant evidence to the top. Filter out noisy or tangential results (see Topic 1: Naive vs Production RAG).
- Constrain to cited evidence: Instruct the model to only make claims supported by the provided context, and to cite specific passages.
- Require abstention: When evidence is weak or missing, the system should refuse to answer rather than guess. This is a policy decision, not a model capability.
- Separate grounded vs ungrounded generation: Clearly distinguish factual claims (which must be cited) from general reasoning or hedging language.
Where Hallucination Comes From
| Source | Symptom | Fix |
|---|---|---|
| Poor retrieval | Irrelevant context, model fills gaps from weights | Better embeddings, hybrid search, query rewriting |
| Thin context | Not enough evidence, model over-generalizes | Retrieve more chunks, use multi-hop when needed |
| Stale data | Outdated facts presented confidently | Freshness scoring, index refresh (see Topic 5) |
| Noisy context | Contradictory or irrelevant passages confuse generation | Aggressive reranking, deduplication, filtering |
| Weak prompting | No citation requirement, no abstention instruction | Explicit grounding instructions, structured output |
Python — Grounding enforcement with abstention
# A simple grounding check: if the generated answer cannot be # traced back to retrieved evidence, flag it for abstention. def grounded_generate(question, evidence, llm): # Instruct the model to cite evidence and abstain if unsupported system_prompt = ( "Answer the question using ONLY the provided evidence. " "Cite evidence by [doc_id]. If the evidence does not support " "a confident answer, respond with: 'I don't have enough " "information to answer this reliably.'" ) response = llm.generate( system=system_prompt, user=f"Question: {question}\n\nEvidence:\n{format_evidence(evidence)}" ) # Post-generation check: does the answer cite at least one source? if not has_citations(response): # No citations found - likely ungrounded, trigger fallback return {"answer": None, "status": "abstained", "reason": "no_citations"} return {"answer": response, "status": "grounded"}
Can you completely eliminate hallucination in RAG?
How do you measure hallucination rate in production?
How citations, freshness policies, and access control turn retrieval outputs into answers that users and auditors can trust.
Citations & Provenance
Why Provenance Is a Control Mechanism
In enterprise, legal, medical, and compliance-heavy environments, a fluent answer without attribution is a liability. Citations serve multiple roles:
- User trust: Readers can click through and verify claims against the original source.
- Auditability: Compliance teams can review what evidence the system relied on for a given answer.
- Debugging: When an answer is wrong, citations show whether the fault lies in retrieval (wrong source), generation (misinterpreted source), or both.
- Feedback signal: Citation click-through rates indicate whether users find the sources useful.
Citation Implementation Patterns
| Pattern | How It Works | Strength |
|---|---|---|
| Inline references | Model emits [1], [2] tags mapped to source list | Familiar to users, easy to verify |
| Passage-level grounding | Each claim links to the exact passage that supports it | Fine-grained auditability |
| Post-hoc attribution | After generation, a separate model maps claims to evidence | Works with models that do not cite natively |
| Structured output | Response schema includes claims + source fields | Machine-parseable, easy to validate |
Python — Structured citation output
# Generate an answer with structured citations. # Each claim maps to the evidence passage that supports it. import json def generate_cited_answer(question, evidence_docs, llm): # Build a schema-enforced prompt for citation schema_instruction = """Respond in JSON: { "answer": "your full answer text with [1], [2] markers", "citations": [ {"id": 1, "doc_id": "...", "passage": "exact quote", "claim": "what it supports"} ], "confidence": "high | medium | low", "unsupported_claims": ["any claims you could not ground"] }""" # Format evidence with doc IDs for reference evidence_text = "\n".join( f"[Doc {d['id']}]: {d['text']}" for d in evidence_docs ) raw = llm.generate( system=schema_instruction, user=f"Question: {question}\n\nEvidence:\n{evidence_text}" ) # Parse and validate the structured response result = json.loads(raw) # Flag answers where no citations were produced if not result.get("citations"): result["confidence"] = "low" return result
How do you verify that citations are accurate, not just present?
What happens when the same fact appears in multiple sources?
Freshness & Knowledge Updates
The Freshness Problem
Base language models have a training knowledge cutoff. RAG was partly invented to bridge this gap — but the bridge only works if the document index is kept current. A stale index recreates the same problem RAG was meant to solve.
Freshness Controls
- Ingestion schedules: Define how frequently new or updated documents are processed and indexed. Real-time for critical sources, batch for stable references.
- Document versioning: Track which version of a document is in the index. When a document is updated, the old version should be replaced or marked superseded.
- Deletion policies: Remove or tombstone documents that are no longer valid. A retracted policy document should not appear as evidence.
- Freshness scoring: Weight more recent documents higher in ranking (see Topic 1: Naive vs Production RAG) so the system naturally prefers current information.
- Uncertainty communication: If the system cannot guarantee freshness for a request, it should communicate uncertainty or route to a more reliable source instead of inventing confidence.
Python — Freshness-aware retrieval filter
# Filter and score documents by freshness before ranking. # Ensures stale evidence does not dominate the context window. from datetime import datetime, timedelta def freshness_score(doc, max_age_days=90): # Calculate how fresh the document is (0 = expired, 1 = brand new) age = (datetime.now() - doc["last_updated"]).days if age > max_age_days: return 0.0 # Beyond max age, treat as stale return 1.0 - (age / max_age_days) def filter_stale(docs, min_freshness=0.1): # Remove documents below the freshness threshold fresh = [d for d in docs if freshness_score(d) >= min_freshness] if not fresh: # All docs are stale - signal uncertainty upstream return [], "all_evidence_stale" return fresh, "ok"
How do you handle documents that are old but still authoritative?
What is the cost of real-time index updates vs batch?
Permissions & Access Control
Why Display-Time Filtering Is Insufficient
If access control is only applied after generation, the model has already processed restricted content. It may paraphrase confidential information, use restricted facts in its reasoning chain, or subtly reference protected data. The damage is done at retrieval time, not display time.
Enforcement Patterns
- Permission-aware indexing: Tag every document with ACLs (access control lists) at ingestion time. Retrieval queries include user permission metadata as a filter.
- Pre-retrieval filtering: Before the similarity search runs, apply metadata filters that exclude documents the user cannot access.
- Tenant isolation: In multi-tenant systems, maintain separate indices or strict partition keys per tenant so cross-tenant data leakage is structurally impossible.
- Prompt instructions as defense-in-depth: Telling the model "do not reveal secrets" is a backup, not a primary control. It fails under prompt injection and adversarial queries.
Common Mistakes
| Mistake | Risk | Correction |
|---|---|---|
| Filter at display only | Model already saw restricted content | Filter at retrieval time |
| Rely on prompt instructions | Bypassable via prompt injection | Use architectural controls |
| Shared cache across users | User A sees User B's results | Permission-scoped caching |
| Stale ACL metadata | Revoked access still works | Sync ACLs on ingestion refresh |
Python — Permission-scoped retrieval
# Enforce document permissions at retrieval time. # Never let the model see documents the user cannot access. def permission_scoped_search(query, user, vector_store): # Get the user's permission groups user_groups = user["access_groups"] # e.g., ["engineering", "public"] # Build a metadata filter that restricts to allowed documents acl_filter = { "access_groups": {"$in": user_groups} } # The vector search only sees documents matching the ACL filter results = vector_store.similarity_search( query=query, k=10, filter=acl_filter # Pre-retrieval: restricted docs never enter the results ) # Double-check: log any result without ACL metadata as anomalous for r in results: if not r.metadata.get("access_groups"): logger.warning(f"Doc {r.id} missing ACL metadata") return results
How do caching layers interact with access control?
What about documents with mixed sensitivity within a single file?
More sophisticated retrieval and optimization strategies for complex tasks and high-traffic production systems.
Agentic RAG
What Makes RAG "Agentic"
The key distinction is autonomy in retrieval strategy. A standard RAG pipeline has a fixed sequence: embed query, search index, rerank, generate. An agentic pipeline adds decision points where the model can:
- Rewrite or decompose the query before retrieval
- Choose among multiple retrieval tools (vector search, keyword search, SQL query, API call)
- Evaluate intermediate results and decide whether to retrieve more
- Route to different generation strategies based on evidence quality
When Agentic RAG Is Worth the Complexity
| Use Case | Why Agentic Helps | Standard RAG Limitation |
|---|---|---|
| Multi-step questions | Decomposes and retrieves iteratively | Single-hop misses linked facts |
| Heterogeneous sources | Chooses the right tool per sub-task | Only queries one index type |
| Ambiguous queries | Rewrites for clarity before retrieval | Retrieves with the original vague query |
| Quality-sensitive domains | Self-evaluates and retries on weak evidence | Returns whatever it finds first |
When to Show Restraint
Not every retrieval workflow needs agent behavior. Agentic RAG adds latency, complexity, and failure paths. For simple factoid lookups, it is engineering overkill. The interview-strength answer is knowing when to reach for it, not defaulting to it.
Python — Agentic retrieval loop with tool selection
# Agentic RAG: the model decides which tools to use and # whether evidence is sufficient before generating. def agentic_rag(question, tools, llm, max_steps=5): # Available tools: vector_search, sql_query, api_lookup, etc. context = [] plan = llm.plan(question, available_tools=tools) for step in plan.steps[:max_steps]: # The model chose which tool and query to use tool = tools[step.tool_name] result = tool.execute(step.query) context.append({"tool": step.tool_name, "result": result}) # After each step, ask: do we have enough evidence? sufficiency = llm.evaluate_evidence(question, context) if sufficiency.is_sufficient: break # Evidence is good enough, proceed to generation # Generate the final answer using all collected evidence return llm.generate(question, context)
How do you prevent agentic loops from running forever?
How does agentic RAG relate to multi-hop retrieval?
What are the failure modes unique to agentic RAG?
Caching Layers
What to Cache in RAG
| Cache Layer | What Is Cached | Benefit | Invalidation Challenge |
|---|---|---|---|
| Embedding cache | Query → embedding vector | Avoids re-computing embeddings | Low risk; embeddings are deterministic |
| Retrieval cache | Query → top-k documents | Skips vector search entirely | Must invalidate when index changes |
| Reranked set cache | Query → reranked candidates | Skips expensive reranking | Must respect freshness and signal changes |
| Answer cache | Query → final generated answer | Skips generation entirely | Highest risk: stale answers, permission leaks |
Cache Safety Rules
- Freshness TTLs: Every cache entry needs a time-to-live aligned with the source data's update frequency. A news corpus needs minute-level TTLs; a legal reference might tolerate daily.
- Permission-scoped keys: Cache keys must include the user's permission scope. User A's cached answer must never be served to User B if they have different access levels (see Topic 6: Permissions & Access Control).
- Semantic deduplication: Near-duplicate queries (e.g., "What is RAG?" vs "What does RAG mean?") can share cache entries if a similarity threshold is met, but this needs careful tuning.
Python — Permission-scoped answer cache
# A permission-aware cache for RAG answers. # Cache keys include user scope to prevent cross-user leaks. import hashlib, time class ScopedAnswerCache: def __init__(self, ttl_seconds=300): self.cache = {} self.ttl = ttl_seconds def _key(self, query, user_scope): # Include user permissions in the cache key scope_str = ",".join(sorted(user_scope)) raw = f"{query}||{scope_str}" return hashlib.sha256(raw.encode()).hexdigest() def get(self, query, user_scope): key = self._key(query, user_scope) entry = self.cache.get(key) if entry and (time.time() - entry["ts"]) < self.ttl: return entry["answer"] # Cache hit, within TTL return None # Cache miss or expired def put(self, query, user_scope, answer): key = self._key(query, user_scope) self.cache[key] = {"answer": answer, "ts": time.time()}
How do you handle cache invalidation when documents are updated?
Is semantic caching worth the complexity?
Evaluating RAG systems rigorously and knowing when RAG is not the right tool for the job.
Evaluation: Offline & Online
Offline Evaluation
Run the RAG pipeline against a curated test set with known-good answers and source documents. Measure:
- Retrieval recall/precision: Are the right documents being retrieved? Are irrelevant ones excluded?
- Groundedness: Does the generated answer only make claims supported by the retrieved evidence?
- Citation correctness: Do citations point to passages that actually support the cited claim?
- Answer quality: Overall accuracy, completeness, and usefulness as judged by human reviewers or LLM-as-judge.
- Abstention correctness: Does the system refuse to answer when it should?
Online Evaluation
Observe the system in production with real users:
- User satisfaction: Thumbs-up/down, star ratings, NPS-style surveys.
- Task completion: Did the user accomplish what they came for?
- Correction rate: How often do users rephrase and retry?
- Escalation rate: How often do users abandon the AI and contact a human?
- Citation click-through: Are users actually verifying sources?
Error Decomposition
| Error Type | Symptom | Diagnostic |
|---|---|---|
| Retrieval error | Right answer exists but was not retrieved | Check recall metrics on known-good docs |
| Ranking error | Right doc retrieved but ranked too low | Inspect rank positions of ground-truth docs |
| Prompt error | Right docs in context but instructions unclear | A/B test prompt variations with same context |
| Generation error | Right docs, right prompt, but model hallucinates | Groundedness score comparison across models |
Python — Offline evaluation harness
# A minimal offline evaluation harness for RAG. # Measures retrieval recall, groundedness, and answer quality. def evaluate_rag(test_cases, rag_pipeline, judge_llm): results = [] for tc in test_cases: # Run the RAG pipeline on the test question output = rag_pipeline.run(tc["question"]) # Metric 1: Did we retrieve the expected source documents? retrieved_ids = {d["id"] for d in output["retrieved_docs"]} expected_ids = set(tc["expected_doc_ids"]) recall = len(retrieved_ids & expected_ids) / max(len(expected_ids), 1) # Metric 2: Is the answer grounded in the evidence? groundedness = judge_llm.score_groundedness( answer=output["answer"], evidence=output["retrieved_docs"] ) # Metric 3: Overall answer quality vs reference quality = judge_llm.score_quality( answer=output["answer"], reference=tc["reference_answer"] ) results.append({ "question": tc["question"], "retrieval_recall": recall, "groundedness": groundedness, "quality": quality }) return results
How large should the offline test set be?
How do you handle the "LLM-as-judge" reliability problem?
When Not to Use RAG
When RAG Is Not the Answer
| Scenario | Why RAG Is a Poor Fit | Better Alternative |
|---|---|---|
| Deterministic computation | Math, logic, and data transformations do not need retrieval | Code execution, SQL, calculators |
| Structured data queries | Tabular data is better queried via SQL than chunked into vectors | Text-to-SQL, direct API access |
| Stable procedural tasks | Step-by-step workflows do not change per query | Hardcoded logic, decision trees, templates |
| Noisy document corpus | Low-quality sources yield unreliable retrieval | Curate data first, or use structured extraction |
| Simple search suffices | User just needs to find a document, not get an answer | Traditional search with snippets |
The Adequacy Principle
The best solution is the one that is adequate, controllable, and maintainable. RAG adds retrieval infrastructure, embedding pipelines, index management, and generation complexity. If the task can be solved with a simpler architecture that meets the quality bar, prefer simplicity.
Signs You Are Over-Engineering
- The retrieval step returns the same few documents for every query (the corpus is too small or narrow).
- Users never read the AI-generated answer — they always click through to the source document.
- The generation step adds no value beyond what the raw search snippet provides.
- Maintaining the index and pipeline costs more engineering time than the feature saves.
Python — Decision framework: RAG vs alternatives
# A simple decision function to route tasks to the right architecture. # Not every LLM task needs RAG; use the simplest adequate approach. def choose_architecture(task): # Check if the task is better served by deterministic tools if task.requires_computation: return "code_execution" # Math, data transforms, logic if task.data_is_structured: return "text_to_sql" # Tabular data, databases if task.workflow_is_static: return "template_engine" # Fixed procedures, decision trees if task.corpus_too_noisy: return "curate_first" # Clean data before building RAG if task.search_suffices: return "traditional_search" # Snippets are enough # RAG is appropriate: unstructured knowledge, grounding needed if task.needs_grounded_answers and task.has_quality_corpus: return "rag" return "evaluate_further" # Edge case - needs human judgment