Ch 15: Text Generation, Decoding & Serving at Scale

Decoding Strategies

How the model selects the next token determines the quality, determinism, and diversity of generated text. These controls are product decisions, not just academic parameters.

Temperature, Top-k & Top-p

Temperature rescales the token probability distribution; top-k limits sampling to the k highest-probability tokens; top-p (nucleus sampling) limits sampling to the smallest set whose cumulative probability exceeds p. Together they control the determinism-diversity trade-off.

🧠Think of temperature as a volume knob for randomness. At 0 the model always picks the loudest note; at high values it lets quieter notes play too. Top-k is a hard cutoff on the orchestra size; top-p is a dynamic cutoff that adjusts based on how confident the model is.

Temperature 1.0

How Temperature Works

Temperature divides the logits (raw scores) before the softmax function. A temperature of 1.0 leaves the distribution unchanged. Lower values sharpen the distribution toward the top token (more deterministic), while higher values flatten it (more random). At T → 0, sampling becomes equivalent to greedy decoding.

The formula is: P(token_i) = exp(logit_i / T) / sum(exp(logit_j / T)).

Top-k Filtering

Top-k restricts sampling to the k most probable tokens and zeroes out all others. This provides a hard ceiling on tail exploration. The downside is that a fixed k may be too restrictive when the model is uncertain (many plausible continuations) or too permissive when the model is confident.

Top-p (Nucleus Sampling)

Top-p dynamically adjusts the candidate set by including tokens in order of probability until their cumulative mass exceeds p. When the model is confident, nucleus sampling uses fewer tokens; when it is uncertain, it naturally expands the set. Holtzman et al. (2020) showed this avoids the "neural text degeneration" caused by greedy or beam-search methods.

Choosing the Right Settings

Use Case	Temperature	Top-k	Top-p
Deterministic extraction (JSON, SQL)	0.0 – 0.2	1 – 5	0.1 – 0.3
Factual Q&A	0.3 – 0.5	10 – 40	0.5 – 0.8
Creative writing / brainstorming	0.7 – 1.2	40 – 100	0.9 – 0.95
Experimental / chaotic generation	1.5+	200+	0.98+

Interview tip: The right decoding setup for deterministic extraction is very different from exploratory brainstorming. Always frame these as product controls that affect quality, determinism, speed, and user experience — not as magical creativity knobs.

✔Key Takeaway: Temperature, top-k, and top-p are decoding controls that trade off determinism against diversity. Nucleus sampling (top-p) is often the strongest default because it adapts dynamically to model confidence.

Python — Sampling controls in a text-generation pipeline

# Demonstrates how temperature, top-k, and top-p are set at the API edge.
# These are product controls: lowering temperature makes output deterministic,
# while top-p dynamically adjusts the sampling nucleus.
from transformers import pipeline

# Initialize a text-generation pipeline with a base model
generator = pipeline("text-generation", model="gpt2")

# Generate with explicit decoding controls
result = generator(
    "Explain retrieval-augmented generation in simple terms:",
    max_new_tokens=120,       # Hard cap on output length (cost + latency control)
    temperature=0.7,           # Moderate randomness - good for explanations
    top_k=40,                  # Hard ceiling: only top 40 tokens considered
    top_p=0.9,                 # Nucleus: include tokens until 90% cumulative prob
    do_sample=True             # Enable sampling (False = greedy decoding)
)

# Output the generated text
print(result[0]["generated_text"])

Follow-up Questions

What happens when you combine top-k and top-p?

When both are set, they act as intersecting filters. Top-k first caps the candidate set to k tokens, then top-p further narrows within that set to the nucleus. In practice this means the stricter constraint dominates. Many frameworks apply top-k first, then top-p on the surviving tokens.

Why not just set temperature to 0 for all production use cases?

Zero temperature collapses sampling to greedy decoding, which can produce repetitive, degenerate text for open-ended tasks. It also makes the model brittle — small prompt changes can flip the entire output because there is no stochastic exploration. For extraction tasks it works well, but for generation tasks some randomness improves quality.

How does repetition penalty relate to these controls?

Repetition penalty is a separate decoding control that downweights tokens that have already appeared in the generated sequence. It addresses a failure mode that temperature and top-p do not directly solve: even with moderate randomness, models can loop. Repetition penalty and frequency penalty are complementary post-logit adjustments.

Beam Search vs Greedy Decoding

Greedy decoding picks the single highest-probability token at each step. Beam search keeps several candidate continuations alive in parallel, giving the system a chance to recover from locally attractive but globally poor choices.

🧠Greedy is like walking through a maze and always turning toward the exit sign — you might hit a dead end. Beam search sends multiple scouts down different corridors and picks whichever team finds the best overall path.

Greedy Decoding

At each step, greedy decoding selects argmax P(token | context). It is the cheapest search strategy — O(V) per step where V is vocabulary size — and produces fully deterministic output. However, it can get trapped by locally optimal but globally suboptimal token sequences, leading to repetitive or incoherent text.

Beam Search

Beam search maintains B candidate sequences (beams) at each step. Each beam is expanded by considering all possible next tokens, then only the top B joint-probability sequences survive. This gives the system a broader view of the search space without exhaustive enumeration.

The trade-off: beam search costs roughly B times more compute per step and still optimizes a narrow probability objective (maximum likelihood), so it is not automatically better for open-ended generation. It works best for constrained tasks like translation, summarization, and structured output.

When to Use Which

Strategy	Best For	Weakness
Greedy	Speed-critical deterministic tasks	Can degenerate on open-ended text
Beam search	Translation, constrained generation	Expensive; tends toward generic output
Sampling (top-p)	Creative, conversational, open-ended	Non-deterministic; needs tuning

Interview frame: Generation is a search problem. Greedy is the cheapest search, beam search is a broader but still limited search, and sampling trades optimality for diversity. See also Topic 1: Temperature, Top-k & Top-p for sampling controls.

✔Key Takeaway: Beam search improves coherence for constrained tasks by exploring multiple paths, but costs more compute and can produce bland outputs. For open-ended generation, sampling methods usually win.

Python — Greedy vs beam search decoding

# Compare greedy decoding with beam search on the same prompt.
# Greedy: fast but can get stuck.
# Beam: explores multiple candidates but costs B times more.
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")

# Greedy decoding: always pick the top token
greedy_out = generator(
    "The key advantage of transformers is",
    max_new_tokens=50,
    do_sample=False,         # Disable sampling = greedy
    num_beams=1              # Single beam = greedy search
)

# Beam search: keep 4 candidate sequences alive
beam_out = generator(
    "The key advantage of transformers is",
    max_new_tokens=50,
    do_sample=False,         # Beam search is deterministic
    num_beams=4,             # 4 beams = 4 parallel candidates
    early_stopping=True      # Stop when all beams reach EOS
)

print("Greedy:", greedy_out[0]["generated_text"])
print("Beam:", beam_out[0]["generated_text"])

Follow-up Questions

Why does beam search tend to produce generic text?

Beam search maximizes joint probability, which inherently favors high-frequency, safe token sequences. It concentrates probability mass on common phrasings rather than exploring surprising or creative continuations. This is why creative tasks prefer sampling-based methods that intentionally deviate from the mode.

Can you combine beam search with sampling?

Yes. Stochastic beam search and diverse beam search introduce randomness or diversity penalties into the beam expansion step. Some frameworks let you set do_sample=True with num_beams > 1, which samples within each beam expansion rather than always taking the argmax.

How does beam width affect latency?

Compute scales roughly linearly with beam width (B beams = B forward passes per step). Memory also grows because each beam maintains its own KV cache (see Topic 5: KV Cache). In latency-sensitive serving, beam width above 4-5 is rarely justified.

Streaming Generation

Streaming returns tokens incrementally instead of waiting for the full response. This improves perceived latency, keeps users engaged, and allows interfaces to feel responsive even when total generation time is significant.

🧠Streaming is like reading a letter as it is being typed rather than waiting for the envelope to arrive. The content is the same, but the experience is fundamentally different — and so are the error-handling requirements.

Why Streaming Matters

For a 500-token response at 50 tokens/second, non-streaming means 10 seconds of blank screen before any content appears. Streaming shows the first token in ~100ms. Users perceive streaming responses as 3-5x faster even when total wall-clock time is identical or slightly worse.

Systems Implications

Streaming is not merely a frontend trick. It changes several backend behaviors:

Cancellation: Users can stop generation mid-stream, saving compute on unwanted output
Error handling: Errors mid-generation require graceful partial-output policies
Moderation: Safety filters must work on partial outputs or operate with a token-buffer delay
Connection management: Server-Sent Events (SSE) or WebSockets replace simple request-response

Streaming Protocols

Protocol	Pattern	Use Case
Server-Sent Events (SSE)	Unidirectional server push over HTTP	Most LLM APIs (OpenAI, Anthropic)
WebSockets	Bidirectional persistent connection	Interactive / multi-turn sessions
gRPC streaming	Bidirectional, typed, multiplexed	Internal microservices

✔Key Takeaway: Streaming transforms user experience and backend architecture simultaneously. It is a systems design decision that affects cancellation, error handling, moderation, and connection management — not just perceived speed.

Python — Streaming with the OpenAI-compatible API

# Streaming generation using an OpenAI-compatible client.
# Each chunk arrives as a server-sent event; the client
# processes tokens incrementally instead of waiting.
from openai import OpenAI

client = OpenAI()

# Create a streaming completion request
stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain KV caching."}],
    stream=True,               # Enable SSE streaming
    max_tokens=200             # Budget cap for cost control
)

# Process each token as it arrives
for chunk in stream:
    # Each chunk contains a delta with the new token
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)  # Flush for real-time display

Follow-up Questions

How does streaming interact with safety moderation?

With streaming, the full output is not available before delivery begins. Systems either apply moderation with a token buffer delay (accumulate N tokens, moderate, then release) or use a parallel classifier that can flag and halt the stream mid-generation. The buffer approach adds latency; the parallel approach adds infrastructure complexity.

What happens if an error occurs mid-stream?

The client has already received partial output that cannot be "unsent." Systems handle this with error event types in the SSE protocol, partial-output metadata, and client-side logic to display error states gracefully. The UI must be designed to handle abrupt stream termination without confusing the user.

Does streaming affect batching efficiency?

Yes. In streaming mode, different requests in a batch finish at different times. Continuous batching (iteration-level batching) helps: when one request in the batch finishes, a new request can take its slot immediately, rather than waiting for the entire batch to complete. See Topic 4: Batching & Concurrency.

Serving Infrastructure

The engineering systems that determine how quickly, efficiently, and reliably the model delivers responses under real production load.

Batching & Concurrency

Batching allows multiple requests to share accelerator work, improving hardware utilization. Concurrency strategies help the server manage many active sessions without starving some users. Together they raise throughput substantially, especially for high-volume inference.

🧠Batching is like filling a bus with passengers going the same direction: more efficient than individual taxis, but you wait for the bus to fill. The art is deciding how long to wait before departing.

Batch Size 1

Static vs Continuous Batching

Static batching groups requests that arrive together and processes them as one unit. All requests in the batch must wait for the longest one to finish. Continuous batching (also called iteration-level or in-flight batching) allows new requests to join and completed requests to leave at each decode step, dramatically improving GPU utilization.

The Latency-Throughput Trade-off

Larger batches improve throughput (tokens/second across all requests) but can increase latency for individual requests. The optimal batch size depends on hardware memory, model size, and the service-level objective (SLO). Frameworks like vLLM, TensorRT-LLM, and TGI implement continuous batching to balance this trade-off.

Concurrency Patterns

Pattern	How It Works	Trade-off
Request queue + fixed batch	Accumulate requests, dispatch in fixed groups	Simple but wastes GPU on short requests
Continuous batching	Add/remove requests per decode iteration	Higher utilization, more complex scheduler
Multi-model routing	Route to different model sizes by complexity	Best efficiency, requires routing logic

Interview frame: Strong answers frame serving as a balance between tail latency and cluster utilization. Continuous batching is the state of the art because it decouples individual request completion from batch completion.

✔Key Takeaway: Batching is the primary lever for GPU utilization in LLM serving. Continuous batching solves the head-of-line blocking problem by letting requests join and leave the batch at each decode step.

Python — Simulating batch vs sequential inference timing

# Demonstrates the throughput advantage of batching.
# In real serving, continuous batching handles this dynamically,
# but this shows the core principle: shared GPU work.
import time

def simulate_inference(n_requests, batch_size, per_token_ms=10, tokens=100):
    """Simulate sequential vs batched inference timing."""
    # Sequential: each request waits for the previous one
    sequential_ms = n_requests * tokens * per_token_ms

    # Batched: GPU processes batch_size requests in parallel
    # Batching adds ~20% overhead per step but shares across requests
    n_batches = (n_requests + batch_size - 1) // batch_size
    batched_ms = n_batches * tokens * per_token_ms * 1.2

    print(f"Sequential: {sequential_ms:.0f}ms")
    print(f"Batched (bs={batch_size}): {batched_ms:.0f}ms")
    print(f"Speedup: {sequential_ms/batched_ms:.1f}x")

simulate_inference(n_requests=8, batch_size=8)

Follow-up Questions

What is head-of-line blocking in LLM serving?

In static batching, a short request that finishes early must wait for the longest request in the batch before the GPU can accept new work. This wastes compute on padding. Continuous batching eliminates this by allowing finished requests to exit the batch immediately.

How does sequence length variance affect batching?

High variance in output length makes static batching inefficient because short requests waste GPU cycles waiting for long ones. Continuous batching handles this naturally. Some systems also use speculative decoding to speed up individual requests within the batch.

KV Cache

The KV cache stores previously computed key and value tensors from attention layers so the model does not recompute them for every generated token. It is one of the most important serving optimizations for decoder-only models.

🧠Imagine re-reading an entire book from page one every time you want to write the next word of your summary. The KV cache is your notes — you only read the new page and consult your existing notes for everything before it.

Why the KV Cache Exists

In autoregressive generation, each new token requires attending to all previous tokens. Without caching, the model would recompute the key and value projections for the entire sequence at every step — turning generation from O(n) to O(n^2) in compute per token. The KV cache converts this redundant repeated computation into reusable state.

Memory Cost

KV cache memory grows with: 2 * n_layers * n_heads * head_dim * seq_len * batch_size * precision_bytes. For a 70B parameter model with 80 layers, 64 heads, and 128-dim heads, a single sequence of 4K tokens in FP16 uses roughly 2.5 GB of KV cache alone. This is why memory is the binding constraint for serving, not compute.

KV Cache Management Strategies

Strategy	Mechanism	Benefit
PagedAttention (vLLM)	Non-contiguous memory blocks like OS virtual memory	Near-zero memory waste from fragmentation
Prefix caching	Share KV cache for common prompt prefixes	Amortizes prefill cost across requests
KV cache quantization	Store K/V in INT8 or FP8	~2x memory reduction with minimal quality loss
Sliding window	Only cache the last W tokens	Bounded memory for long sequences

Interview tip: The KV cache is often the bottleneck that determines how many concurrent requests a server can handle. Understanding its memory footprint and management strategies (especially PagedAttention) is critical for senior serving discussions.

✔Key Takeaway: The KV cache eliminates redundant attention computation during autoregressive decoding. Its memory cost is the primary constraint on serving concurrency, making cache management (PagedAttention, prefix caching, quantization) essential knowledge.

Python — Estimating KV cache memory

# Calculate KV cache memory for a given model and sequence configuration.
# This helps engineers understand why memory (not compute) is the binding
# constraint for LLM serving concurrency.

def kv_cache_memory_gb(
    n_layers: int,        # Number of transformer layers
    n_heads: int,         # Number of attention heads (for KV)
    head_dim: int,        # Dimension per head
    seq_len: int,         # Total sequence length (prompt + output)
    batch_size: int = 1,  # Concurrent requests
    precision: int = 2   # Bytes per value (2=FP16, 1=INT8)
) -> float:
    # Factor of 2 for keys AND values
    total_bytes = 2 * n_layers * n_heads * head_dim * seq_len * batch_size * precision
    return total_bytes / (1024 ** 3)  # Convert to GB

# Example: Llama-70B style model
mem = kv_cache_memory_gb(
    n_layers=80, n_heads=8,   # GQA: 8 KV heads (not 64 query heads)
    head_dim=128, seq_len=4096,
    batch_size=32, precision=2  # FP16
)
print(f"KV cache for 32 concurrent 4K requests: {mem:.1f} GB")

Follow-up Questions

What is PagedAttention and why did it change LLM serving?

PagedAttention (introduced by vLLM) manages KV cache like virtual memory pages in an operating system. Instead of pre-allocating contiguous memory for the maximum sequence length, it allocates small blocks on demand and maps them via a page table. This eliminates memory fragmentation and increases the number of concurrent requests a server can handle by 2-4x.

How does prefix caching work?

When multiple requests share the same system prompt or few-shot prefix, prefix caching computes the KV cache for that prefix once and shares it across all requests. This amortizes the expensive prefill phase. It is especially valuable for systems that use long system prompts or retrieval-augmented contexts with common prefixes.

How does grouped-query attention (GQA) affect KV cache size?

GQA uses fewer KV heads than query heads (e.g., 8 KV heads vs 64 query heads in Llama-2 70B). Since KV cache size scales with n_kv_heads, GQA reduces cache memory by the ratio of KV heads to query heads — often 4-8x — with minimal quality loss. This is why most modern large models use GQA.

Quantization

Quantization reduces the numerical precision of model weights (and sometimes activations), lowering memory usage and often improving inference speed. It makes large models fit on cheaper hardware or increases concurrent request capacity.

🧠Quantization is like compressing a high-resolution photo to JPEG. You lose some fine detail, but the image is still recognizable and takes far less storage. The question is always how much compression you can tolerate before the picture degrades noticeably.

Precision Levels

Precision	Bits	Memory (70B model)	Quality Impact
FP32	32	~280 GB	Baseline (training)
FP16 / BF16	16	~140 GB	Negligible loss
INT8 (W8A8)	8	~70 GB	Minimal loss, well-validated
INT4 (W4A16)	4	~35 GB	Noticeable on some tasks
GPTQ / AWQ / GGUF	2-4	~17-35 GB	Task-dependent; must validate

Weight-Only vs Weight-Activation Quantization

Weight-only quantization (W4A16, W8A16) stores weights in low precision but computes in higher precision. This saves memory and bandwidth but does not accelerate compute. Weight-activation quantization (W8A8, W4A4) quantizes both, enabling hardware-accelerated low-precision math (INT8 tensor cores), which improves both memory and speed.

When Quantization Hurts

Aggressive quantization can degrade output on tasks requiring precise numerical reasoning, long-form coherence, or rare-knowledge recall. The degradation is often non-uniform — some layers are more sensitive than others. Mixed-precision approaches (quantize most layers aggressively, keep sensitive layers at higher precision) help mitigate this.

Interview frame: Quantization is an engineering trade-off. It often delivers major efficiency gains, but quality must be validated per-task because aggressive compression can degrade fidelity. See Topic 7: Throughput vs Latency for how quantization fits the broader serving optimization landscape.

✔Key Takeaway: Quantization trades precision for efficiency. INT8 is generally safe; INT4 requires validation. The real question is not "how much can we compress?" but "does compressed quality meet the product bar for this task?"

Python — Loading a quantized model with bitsandbytes

# Load a model in 4-bit quantization using bitsandbytes.
# This reduces a 70B model from ~140 GB (FP16) to ~35 GB,
# making it fit on a single A100 80GB GPU.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,                 # Enable 4-bit weight quantization
    bnb_4bit_quant_type="nf4",         # NormalFloat4: optimized for normal distributions
    bnb_4bit_compute_dtype="bfloat16", # Compute in BF16 for stability
    bnb_4bit_use_double_quant=True    # Quantize the quantization constants too
)

# Load model with quantization applied
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=quant_config,
    device_map="auto"                  # Auto-distribute across available GPUs
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
print(f"Model loaded. Memory: {model.get_memory_footprint() / 1e9:.1f} GB")

Follow-up Questions

What is the difference between GPTQ, AWQ, and GGUF?

GPTQ uses layer-wise second-order optimization to find the best low-bit representation. AWQ (Activation-Aware Weight Quantization) protects salient weights based on activation magnitudes. GGUF is a file format (used by llama.cpp) that supports various quantization levels with CPU-optimized kernels. They represent different points in the accuracy-speed-compatibility space.

How do you validate that quantization has not degraded quality?

Run the quantized model against a task-specific evaluation suite and compare against the full-precision baseline. Check perplexity on held-out text, accuracy on structured tasks, and do human evaluation on open-ended generation. Pay special attention to long-tail and edge-case performance, where quantization damage shows up first.

Throughput vs Latency

Throughput measures how much total work the system does over time; latency measures how long one request takes. Optimizing one can hurt the other. The right objective depends on whether the product is a batch pipeline or an interactive copilot.

🧠A highway's throughput is cars per hour. Latency is how long your commute takes. Adding more lanes (GPUs) can help both, but packing the highway with more cars (bigger batches) improves throughput while making your individual trip slower due to merging delays.

Load (requests/sec) 5 rps

Key Metrics

Metric	Definition	Optimized By
Time to First Token (TTFT)	Latency before first token appears	Fast prefill, streaming, model routing
Inter-Token Latency (ITL)	Time between consecutive tokens	KV cache, quantization, small batch size
Total Generation Time	Wall-clock time for full response	All of the above + output length
Throughput (tokens/sec)	Total tokens generated across all requests	Large batches, continuous batching, GPUs
P99 Latency	Worst-case latency for 99th percentile	Queue management, load shedding, SLOs

Product-Driven Optimization

The right optimization target depends on the product:

Interactive copilots: Optimize TTFT and ITL. Users feel every millisecond of delay.
Batch processing: Optimize throughput. Individual request latency matters less than cost-per-token.
Mixed workloads: Use tiered serving with priority queues — interactive requests get low-latency fast paths, batch requests fill remaining GPU capacity.

Interview frame: A highly batched cluster may be throughput-efficient while still feeling slow to users. Strong answers separate TTFT, ITL, and total generation time as distinct SLO dimensions. See Topic 4: Batching & Concurrency for the batching mechanisms behind this trade-off.

✔Key Takeaway: Throughput and latency are often in tension. The right balance is a product decision: interactive systems prioritize TTFT and tail latency; batch systems prioritize cost-per-token throughput.

Python — Measuring TTFT and throughput

# Measure Time to First Token (TTFT) and throughput for a streaming request.
# In production, these metrics feed SLO dashboards and autoscaling rules.
import time
from openai import OpenAI

client = OpenAI()

start = time.perf_counter()
first_token_time = None
token_count = 0

# Stream a completion and measure timing
stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "List 10 LLM serving optimizations."}],
    stream=True, max_tokens=300
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        if first_token_time is None:
            first_token_time = time.perf_counter()  # Capture TTFT
        token_count += 1

end = time.perf_counter()

# Report metrics
ttft = (first_token_time - start) * 1000  # ms
total = end - start                          # seconds
throughput = token_count / total              # tokens/sec
print(f"TTFT: {ttft:.0f}ms | Tokens: {token_count} | Throughput: {throughput:.1f} tok/s")

Follow-up Questions

What is the difference between P50 and P99 latency, and why does P99 matter more?

P50 is the median — half of requests are faster. P99 is the 99th percentile — only 1% of requests are slower. P99 matters more because it captures the worst experience your users actually have. In LLM serving, P99 spikes often come from long prompts, cache misses, or queue buildup during load surges.

How does model routing reduce latency for simple queries?

Tiered model routing sends simple queries (classification, short answers) to smaller, faster models and reserves expensive frontier models for complex reasoning tasks. This reduces average latency and cost without sacrificing quality on hard queries. The routing decision itself can be made by a lightweight classifier or heuristic.

Long-Context Serving

Long-context serving is difficult because memory and attention costs grow with sequence length, prompts are more expensive to process, and the risk of context dilution rises. Longer context is a capability, not a default operating mode.

🧠A longer context window is like a bigger desk. You can spread out more documents, but finding the right paper takes longer, the desk costs more, and most tasks do not need every document open at once. Good engineering means knowing when to use the big desk versus filing most papers away.

Why Cost Grows with Context

Attention is O(n^2) in sequence length for standard self-attention, and O(n) for linear/efficient variants. Even with optimizations, longer sequences mean:

KV cache memory: Grows linearly with sequence length (see Topic 5: KV Cache)
Prefill compute: Processing a 128K prompt costs ~32x more than a 4K prompt
Throughput reduction: Fewer concurrent long-context requests fit in GPU memory
Context dilution: Models can lose focus when the relevant information is buried in a large context

Mitigation Strategies

Strategy	Approach	Trade-off
Retrieval instead of stuffing	Pull only relevant chunks into context	Requires good retrieval pipeline
Context compression	Summarize or prune less-relevant passages	Risk of losing critical details
Sliding window attention	Attend only to recent W tokens	Loses long-range dependencies
Hierarchical attention	Coarse attention over distant tokens, fine over recent	Architectural complexity
Chunked prefill	Process long prompts in chunks to limit peak memory	Adds prefill latency

Interview frame: A strong answer connects this back to retrieval and context design. Just because a model supports 128K tokens does not mean you should use all 128K on every request. The cost-quality trade-off must be deliberate.

✔Key Takeaway: Long context is expensive in memory, compute, and quality (context dilution). Good engineering uses retrieval and compression to keep context tight, reserving long windows for tasks that genuinely need them.

Python — Estimating cost scaling with context length

# Show how serving cost and memory scale with context length.
# This helps engineers justify retrieval over context-stuffing.

def context_cost_scaling(base_len=4096, target_lengths=None):
    """Compare relative cost at different context lengths."""
    if target_lengths is None:
        target_lengths = [4096, 8192, 16384, 32768, 65536, 131072]

    print(f"{'Context':>10} {'KV Memory':>12} {'Prefill':>10} {'Max Batch':>10}")
    print("-" * 46)

    for length in target_lengths:
        ratio = length / base_len
        # KV cache scales linearly
        kv_ratio = ratio
        # Prefill compute scales quadratically (standard attention)
        prefill_ratio = ratio ** 2
        # Max concurrent batch inversely proportional to KV size
        batch_ratio = 1.0 / ratio

        print(f"{length:>10,} {kv_ratio:>11.1f}x {prefill_ratio:>9.1f}x {batch_ratio:>9.2f}x")

context_cost_scaling()

Follow-up Questions

What is context dilution and how do you detect it?

Context dilution occurs when relevant information is buried among irrelevant content, causing the model to lose focus or miss critical details. Detection involves running needle-in-a-haystack tests at various positions and lengths. Mitigation includes placing critical information near the beginning or end of the context, and using retrieval to keep context lean.

Should you always use the maximum context window available?

No. Using the full context window on every request wastes compute and money, reduces serving concurrency, and risks context dilution. The best practice is to use retrieval to extract relevant passages and only use long context when the task genuinely requires it (e.g., long-document summarization, multi-document QA).

Production Systems

The safety controls, architectural patterns, and operational concerns that turn model access into a reliable product service.

Safety & Moderation in Generation

Safety controls can be applied before, during, and after generation. Reliable systems layer multiple controls — input screening, tool permission gating, constrained decoding, output moderation — because no single mechanism catches everything.

🧠Think of safety as airport security: ID check at the door (input filtering), metal detectors during boarding (constrained decoding), and air marshals on the plane (output moderation). Each layer catches different threats, and removing any one layer creates blind spots.

The Safety Pipeline

Safety is a pipeline property, not a single classifier. Controls operate at three stages:

Pre-generation: Input screening, prompt injection detection, rate limiting, authentication
During generation: Constrained decoding (blocklists, grammar constraints), tool permission gating
Post-generation: Output classifiers, PII detection, compliance checks, human review triggers

Safety Control Types

Control	Stage	Mechanism	Limitation
Content classifier	Pre + Post	ML model flags harmful content	False positives/negatives
Blocklist / allowlist	During	Prevent specific tokens or phrases	Brittle; easy to circumvent
Refusal training	During	Model trained to decline unsafe requests	Can be jailbroken
Tool permissions	Pre	Gate which tools the model can invoke	Requires access control design
Human escalation	Post	Route uncertain outputs to human review	Latency; does not scale to all requests

Streaming Complicates Safety

When responses are streamed, the full output is not available before delivery begins. This means post-generation classifiers either operate with a buffer delay (accumulate tokens before release) or run in parallel and can halt the stream. Both approaches have trade-offs: buffering adds latency; parallel classification adds infrastructure cost. See Topic 3: Streaming Generation for streaming architecture details.

✔Key Takeaway: Safety is a layered pipeline, not a single filter. Reliable systems combine input screening, constrained decoding, output moderation, and human escalation because no single mechanism catches all failure modes.

Python — Layered safety pipeline

# A simplified layered safety pipeline showing pre/post generation checks.
# In production, each layer is a separate service with its own SLO.

class SafetyPipeline:
    def __init__(self, input_classifier, output_classifier, pii_detector):
        self.input_clf = input_classifier    # Pre-generation safety check
        self.output_clf = output_classifier  # Post-generation content filter
        self.pii_det = pii_detector          # PII redaction layer

    def check_input(self, prompt: str) -> dict:
        """Screen input before generation begins."""
        result = self.input_clf.classify(prompt)
        if result.risk_score > 0.85:
            return {"action": "block", "reason": result.category}
        if result.risk_score > 0.6:
            return {"action": "escalate", "reason": result.category}
        return {"action": "allow"}

    def check_output(self, response: str) -> str:
        """Post-process output: moderate content and redact PII."""
        # Step 1: Content safety
        if self.output_clf.is_unsafe(response):
            return "I cannot provide that information."  # Refusal
        # Step 2: PII redaction
        response = self.pii_det.redact(response)
        return response

Follow-up Questions

How do you handle false positives in safety classifiers?

False positives degrade user experience by blocking legitimate requests. Mitigations include multi-stage classification (cheap fast filter + expensive precise filter), confidence thresholds with escalation paths, user feedback loops, and regular retraining on false-positive examples. The goal is high recall (catch harmful content) without unacceptable precision loss.

What is the difference between safety and alignment?

Safety is about preventing harmful outputs — it is a guardrail. Alignment is about making the model follow human intent and values — it is a training objective (RLHF, Constitutional AI). Safety controls are deployed at serving time; alignment is built into the model during training. Both are necessary: alignment reduces the frequency of unsafe outputs, and safety controls catch what alignment misses.

How do you test safety controls end-to-end?

Use red-teaming: adversarial testers try to bypass safety controls using prompt injection, jailbreaks, and edge cases. Combine this with automated adversarial test suites, regression sets of known-bad inputs, and monitoring of production refusal rates. Safety is not a one-time check — it requires continuous evaluation as models and attacks evolve.

Scalable LLM Service Design

A scalable generation service includes request routing, authentication, prompt assembly, retrieval, model serving, streaming delivery, logging, caching, safety checks, and evaluation feedback loops. It may also use tiered model routing so simple tasks go to cheaper models.

🧠An LLM service is like a restaurant kitchen: the host (router) seats guests, the prep cook (prompt assembly) prepares ingredients, the chef (model) cooks, the waiter (streaming) serves courses as they are ready, and the manager (observability) tracks everything. No single role makes the restaurant work — the system does.

Service Architecture Components

Component	Role	Key Design Decision
API Gateway	Auth, rate limiting, routing	How to route between model tiers
Prompt Assembly	Template + context + retrieval	Caching strategy for common prefixes
Model Serving	Inference, batching, KV cache	Continuous batching framework choice
Streaming Layer	Token delivery via SSE/WebSocket	Buffer size for safety moderation
Safety Pipeline	Input/output moderation	Latency budget for safety checks
Observability	Logging, tracing, metrics	What to log without leaking user data
Evaluation	Quality regression detection	Automated eval vs human eval balance
Cache Layer	Semantic or exact-match caching	Cache invalidation on model updates

Tiered Model Routing

Not every request needs a frontier model. A routing layer can classify incoming requests by complexity and dispatch them to appropriate model tiers:

Tier 1 (fast/cheap): Small models for classification, extraction, simple Q&A
Tier 2 (balanced): Mid-size models for standard generation tasks
Tier 3 (frontier): Large models for complex reasoning, code generation, creative work

Operational Concerns

The best interview answer is architectural. Show that you can connect model behavior to queues, autoscaling, observability, guardrails, and rollback strategy:

Autoscaling: Scale GPU instances based on queue depth and latency SLOs
Rollback: Version model deployments so you can revert if quality regresses
A/B testing: Route a percentage of traffic to new model versions before full rollout
Cost attribution: Track per-team, per-feature token usage for capacity planning

Interview frame: "A good generation service balances decoding quality, streaming responsiveness, and infrastructure efficiency because users feel all three at once." This is the one-sentence summary interviewers want to hear.

✔Key Takeaway: A production LLM service is not just a model endpoint — it is a full system encompassing routing, assembly, serving, streaming, safety, caching, observability, and evaluation. Treat these as one operational path, not separate concerns.

Python — Service architecture skeleton with tiered routing

# Skeleton of a tiered LLM service with routing, safety, and observability.
# Each component would be a separate service in production.

class LLMService:
    def __init__(self, router, safety, models, logger):
        self.router = router       # Classifies request complexity
        self.safety = safety       # Layered safety pipeline
        self.models = models       # Dict of model tiers: {"fast": ..., "frontier": ...}
        self.logger = logger       # Structured logging + metrics

    async def handle_request(self, request):
        """Full request lifecycle: route -> assemble -> generate -> moderate."""

        # 1. Input safety check
        safety_result = self.safety.check_input(request.prompt)
        if safety_result["action"] == "block":
            return RefusalResponse(safety_result["reason"])

        # 2. Route to appropriate model tier
        tier = self.router.classify(request)  # "fast", "balanced", "frontier"
        model = self.models[tier]

        # 3. Assemble prompt (system prompt + retrieved context + user input)
        full_prompt = assemble_prompt(request, tier)

        # 4. Generate with streaming
        async for token in model.generate_stream(full_prompt):
            yield token  # Stream tokens to client

        # 5. Post-generation: log metrics, check output safety
        self.logger.log(tier=tier, tokens=token_count, latency_ms=elapsed)

        # 6. Output moderation (in production, runs in parallel with streaming)
        final_output = self.safety.check_output(full_response)

Follow-up Questions

How do you handle model version rollbacks in production?

Use versioned deployments where each model version runs as a separate deployment behind a load balancer. Rollback means shifting traffic back to the previous version. Combine with automated regression tests that run on every new deployment and alerts that trigger if quality metrics drop below SLO thresholds.

What metrics should an LLM service dashboard show?

Key metrics: TTFT (P50/P99), inter-token latency, throughput (tokens/sec), error rate, queue depth, GPU utilization, cache hit rate, safety refusal rate, and cost per request. Dashboards should segment by model tier, request type, and customer to identify tier-specific bottlenecks and cost attribution.

How does semantic caching work for LLM services?

Semantic caching embeds incoming queries and checks for semantically similar past queries. If a match is found above a similarity threshold, the cached response is returned without calling the model. This works well for repeated informational queries but poorly for personalized or context-dependent requests. Cache invalidation on model updates is a key design challenge.