How the model selects the next token determines the quality, determinism, and diversity of generated text. These controls are product decisions, not just academic parameters.
Temperature, Top-k & Top-p
How Temperature Works
Temperature divides the logits (raw scores) before the softmax function. A temperature of 1.0 leaves the distribution unchanged. Lower values sharpen the distribution toward the top token (more deterministic), while higher values flatten it (more random). At T → 0, sampling becomes equivalent to greedy decoding.
The formula is: P(token_i) = exp(logit_i / T) / sum(exp(logit_j / T)).
Top-k Filtering
Top-k restricts sampling to the k most probable tokens and zeroes out all others. This provides a hard ceiling on tail exploration. The downside is that a fixed k may be too restrictive when the model is uncertain (many plausible continuations) or too permissive when the model is confident.
Top-p (Nucleus Sampling)
Top-p dynamically adjusts the candidate set by including tokens in order of probability until their cumulative mass exceeds p. When the model is confident, nucleus sampling uses fewer tokens; when it is uncertain, it naturally expands the set. Holtzman et al. (2020) showed this avoids the "neural text degeneration" caused by greedy or beam-search methods.
Choosing the Right Settings
| Use Case | Temperature | Top-k | Top-p |
|---|---|---|---|
| Deterministic extraction (JSON, SQL) | 0.0 – 0.2 | 1 – 5 | 0.1 – 0.3 |
| Factual Q&A | 0.3 – 0.5 | 10 – 40 | 0.5 – 0.8 |
| Creative writing / brainstorming | 0.7 – 1.2 | 40 – 100 | 0.9 – 0.95 |
| Experimental / chaotic generation | 1.5+ | 200+ | 0.98+ |
Python — Sampling controls in a text-generation pipeline
# Demonstrates how temperature, top-k, and top-p are set at the API edge. # These are product controls: lowering temperature makes output deterministic, # while top-p dynamically adjusts the sampling nucleus. from transformers import pipeline # Initialize a text-generation pipeline with a base model generator = pipeline("text-generation", model="gpt2") # Generate with explicit decoding controls result = generator( "Explain retrieval-augmented generation in simple terms:", max_new_tokens=120, # Hard cap on output length (cost + latency control) temperature=0.7, # Moderate randomness - good for explanations top_k=40, # Hard ceiling: only top 40 tokens considered top_p=0.9, # Nucleus: include tokens until 90% cumulative prob do_sample=True # Enable sampling (False = greedy decoding) ) # Output the generated text print(result[0]["generated_text"])
What happens when you combine top-k and top-p?
Why not just set temperature to 0 for all production use cases?
How does repetition penalty relate to these controls?
Beam Search vs Greedy Decoding
Greedy Decoding
At each step, greedy decoding selects argmax P(token | context). It is the cheapest search strategy — O(V) per step where V is vocabulary size — and produces fully deterministic output. However, it can get trapped by locally optimal but globally suboptimal token sequences, leading to repetitive or incoherent text.
Beam Search
Beam search maintains B candidate sequences (beams) at each step. Each beam is expanded by considering all possible next tokens, then only the top B joint-probability sequences survive. This gives the system a broader view of the search space without exhaustive enumeration.
The trade-off: beam search costs roughly B times more compute per step and still optimizes a narrow probability objective (maximum likelihood), so it is not automatically better for open-ended generation. It works best for constrained tasks like translation, summarization, and structured output.
When to Use Which
| Strategy | Best For | Weakness |
|---|---|---|
| Greedy | Speed-critical deterministic tasks | Can degenerate on open-ended text |
| Beam search | Translation, constrained generation | Expensive; tends toward generic output |
| Sampling (top-p) | Creative, conversational, open-ended | Non-deterministic; needs tuning |
Python — Greedy vs beam search decoding
# Compare greedy decoding with beam search on the same prompt. # Greedy: fast but can get stuck. # Beam: explores multiple candidates but costs B times more. from transformers import pipeline generator = pipeline("text-generation", model="gpt2") # Greedy decoding: always pick the top token greedy_out = generator( "The key advantage of transformers is", max_new_tokens=50, do_sample=False, # Disable sampling = greedy num_beams=1 # Single beam = greedy search ) # Beam search: keep 4 candidate sequences alive beam_out = generator( "The key advantage of transformers is", max_new_tokens=50, do_sample=False, # Beam search is deterministic num_beams=4, # 4 beams = 4 parallel candidates early_stopping=True # Stop when all beams reach EOS ) print("Greedy:", greedy_out[0]["generated_text"]) print("Beam:", beam_out[0]["generated_text"])
Why does beam search tend to produce generic text?
Can you combine beam search with sampling?
do_sample=True with num_beams > 1, which samples within each beam expansion rather than always taking the argmax.How does beam width affect latency?
Streaming Generation
Why Streaming Matters
For a 500-token response at 50 tokens/second, non-streaming means 10 seconds of blank screen before any content appears. Streaming shows the first token in ~100ms. Users perceive streaming responses as 3-5x faster even when total wall-clock time is identical or slightly worse.
Systems Implications
Streaming is not merely a frontend trick. It changes several backend behaviors:
- Cancellation: Users can stop generation mid-stream, saving compute on unwanted output
- Error handling: Errors mid-generation require graceful partial-output policies
- Moderation: Safety filters must work on partial outputs or operate with a token-buffer delay
- Connection management: Server-Sent Events (SSE) or WebSockets replace simple request-response
Streaming Protocols
| Protocol | Pattern | Use Case |
|---|---|---|
| Server-Sent Events (SSE) | Unidirectional server push over HTTP | Most LLM APIs (OpenAI, Anthropic) |
| WebSockets | Bidirectional persistent connection | Interactive / multi-turn sessions |
| gRPC streaming | Bidirectional, typed, multiplexed | Internal microservices |
Python — Streaming with the OpenAI-compatible API
# Streaming generation using an OpenAI-compatible client. # Each chunk arrives as a server-sent event; the client # processes tokens incrementally instead of waiting. from openai import OpenAI client = OpenAI() # Create a streaming completion request stream = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Explain KV caching."}], stream=True, # Enable SSE streaming max_tokens=200 # Budget cap for cost control ) # Process each token as it arrives for chunk in stream: # Each chunk contains a delta with the new token delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True) # Flush for real-time display
How does streaming interact with safety moderation?
What happens if an error occurs mid-stream?
Does streaming affect batching efficiency?
The engineering systems that determine how quickly, efficiently, and reliably the model delivers responses under real production load.
Batching & Concurrency
Static vs Continuous Batching
Static batching groups requests that arrive together and processes them as one unit. All requests in the batch must wait for the longest one to finish. Continuous batching (also called iteration-level or in-flight batching) allows new requests to join and completed requests to leave at each decode step, dramatically improving GPU utilization.
The Latency-Throughput Trade-off
Larger batches improve throughput (tokens/second across all requests) but can increase latency for individual requests. The optimal batch size depends on hardware memory, model size, and the service-level objective (SLO). Frameworks like vLLM, TensorRT-LLM, and TGI implement continuous batching to balance this trade-off.
Concurrency Patterns
| Pattern | How It Works | Trade-off |
|---|---|---|
| Request queue + fixed batch | Accumulate requests, dispatch in fixed groups | Simple but wastes GPU on short requests |
| Continuous batching | Add/remove requests per decode iteration | Higher utilization, more complex scheduler |
| Multi-model routing | Route to different model sizes by complexity | Best efficiency, requires routing logic |
Python — Simulating batch vs sequential inference timing
# Demonstrates the throughput advantage of batching. # In real serving, continuous batching handles this dynamically, # but this shows the core principle: shared GPU work. import time def simulate_inference(n_requests, batch_size, per_token_ms=10, tokens=100): """Simulate sequential vs batched inference timing.""" # Sequential: each request waits for the previous one sequential_ms = n_requests * tokens * per_token_ms # Batched: GPU processes batch_size requests in parallel # Batching adds ~20% overhead per step but shares across requests n_batches = (n_requests + batch_size - 1) // batch_size batched_ms = n_batches * tokens * per_token_ms * 1.2 print(f"Sequential: {sequential_ms:.0f}ms") print(f"Batched (bs={batch_size}): {batched_ms:.0f}ms") print(f"Speedup: {sequential_ms/batched_ms:.1f}x") simulate_inference(n_requests=8, batch_size=8)
What is head-of-line blocking in LLM serving?
How does sequence length variance affect batching?
KV Cache
Why the KV Cache Exists
In autoregressive generation, each new token requires attending to all previous tokens. Without caching, the model would recompute the key and value projections for the entire sequence at every step — turning generation from O(n) to O(n^2) in compute per token. The KV cache converts this redundant repeated computation into reusable state.
Memory Cost
KV cache memory grows with: 2 * n_layers * n_heads * head_dim * seq_len * batch_size * precision_bytes. For a 70B parameter model with 80 layers, 64 heads, and 128-dim heads, a single sequence of 4K tokens in FP16 uses roughly 2.5 GB of KV cache alone. This is why memory is the binding constraint for serving, not compute.
KV Cache Management Strategies
| Strategy | Mechanism | Benefit |
|---|---|---|
| PagedAttention (vLLM) | Non-contiguous memory blocks like OS virtual memory | Near-zero memory waste from fragmentation |
| Prefix caching | Share KV cache for common prompt prefixes | Amortizes prefill cost across requests |
| KV cache quantization | Store K/V in INT8 or FP8 | ~2x memory reduction with minimal quality loss |
| Sliding window | Only cache the last W tokens | Bounded memory for long sequences |
Python — Estimating KV cache memory
# Calculate KV cache memory for a given model and sequence configuration. # This helps engineers understand why memory (not compute) is the binding # constraint for LLM serving concurrency. def kv_cache_memory_gb( n_layers: int, # Number of transformer layers n_heads: int, # Number of attention heads (for KV) head_dim: int, # Dimension per head seq_len: int, # Total sequence length (prompt + output) batch_size: int = 1, # Concurrent requests precision: int = 2 # Bytes per value (2=FP16, 1=INT8) ) -> float: # Factor of 2 for keys AND values total_bytes = 2 * n_layers * n_heads * head_dim * seq_len * batch_size * precision return total_bytes / (1024 ** 3) # Convert to GB # Example: Llama-70B style model mem = kv_cache_memory_gb( n_layers=80, n_heads=8, # GQA: 8 KV heads (not 64 query heads) head_dim=128, seq_len=4096, batch_size=32, precision=2 # FP16 ) print(f"KV cache for 32 concurrent 4K requests: {mem:.1f} GB")
What is PagedAttention and why did it change LLM serving?
How does prefix caching work?
How does grouped-query attention (GQA) affect KV cache size?
Quantization
Precision Levels
| Precision | Bits | Memory (70B model) | Quality Impact |
|---|---|---|---|
| FP32 | 32 | ~280 GB | Baseline (training) |
| FP16 / BF16 | 16 | ~140 GB | Negligible loss |
| INT8 (W8A8) | 8 | ~70 GB | Minimal loss, well-validated |
| INT4 (W4A16) | 4 | ~35 GB | Noticeable on some tasks |
| GPTQ / AWQ / GGUF | 2-4 | ~17-35 GB | Task-dependent; must validate |
Weight-Only vs Weight-Activation Quantization
Weight-only quantization (W4A16, W8A16) stores weights in low precision but computes in higher precision. This saves memory and bandwidth but does not accelerate compute. Weight-activation quantization (W8A8, W4A4) quantizes both, enabling hardware-accelerated low-precision math (INT8 tensor cores), which improves both memory and speed.
When Quantization Hurts
Aggressive quantization can degrade output on tasks requiring precise numerical reasoning, long-form coherence, or rare-knowledge recall. The degradation is often non-uniform — some layers are more sensitive than others. Mixed-precision approaches (quantize most layers aggressively, keep sensitive layers at higher precision) help mitigate this.
Python — Loading a quantized model with bitsandbytes
# Load a model in 4-bit quantization using bitsandbytes. # This reduces a 70B model from ~140 GB (FP16) to ~35 GB, # making it fit on a single A100 80GB GPU. from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # Configure 4-bit quantization quant_config = BitsAndBytesConfig( load_in_4bit=True, # Enable 4-bit weight quantization bnb_4bit_quant_type="nf4", # NormalFloat4: optimized for normal distributions bnb_4bit_compute_dtype="bfloat16", # Compute in BF16 for stability bnb_4bit_use_double_quant=True # Quantize the quantization constants too ) # Load model with quantization applied model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", quantization_config=quant_config, device_map="auto" # Auto-distribute across available GPUs ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf") print(f"Model loaded. Memory: {model.get_memory_footprint() / 1e9:.1f} GB")
What is the difference between GPTQ, AWQ, and GGUF?
How do you validate that quantization has not degraded quality?
Throughput vs Latency
Key Metrics
| Metric | Definition | Optimized By |
|---|---|---|
| Time to First Token (TTFT) | Latency before first token appears | Fast prefill, streaming, model routing |
| Inter-Token Latency (ITL) | Time between consecutive tokens | KV cache, quantization, small batch size |
| Total Generation Time | Wall-clock time for full response | All of the above + output length |
| Throughput (tokens/sec) | Total tokens generated across all requests | Large batches, continuous batching, GPUs |
| P99 Latency | Worst-case latency for 99th percentile | Queue management, load shedding, SLOs |
Product-Driven Optimization
The right optimization target depends on the product:
- Interactive copilots: Optimize TTFT and ITL. Users feel every millisecond of delay.
- Batch processing: Optimize throughput. Individual request latency matters less than cost-per-token.
- Mixed workloads: Use tiered serving with priority queues — interactive requests get low-latency fast paths, batch requests fill remaining GPU capacity.
Python — Measuring TTFT and throughput
# Measure Time to First Token (TTFT) and throughput for a streaming request. # In production, these metrics feed SLO dashboards and autoscaling rules. import time from openai import OpenAI client = OpenAI() start = time.perf_counter() first_token_time = None token_count = 0 # Stream a completion and measure timing stream = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "List 10 LLM serving optimizations."}], stream=True, max_tokens=300 ) for chunk in stream: if chunk.choices[0].delta.content: if first_token_time is None: first_token_time = time.perf_counter() # Capture TTFT token_count += 1 end = time.perf_counter() # Report metrics ttft = (first_token_time - start) * 1000 # ms total = end - start # seconds throughput = token_count / total # tokens/sec print(f"TTFT: {ttft:.0f}ms | Tokens: {token_count} | Throughput: {throughput:.1f} tok/s")
What is the difference between P50 and P99 latency, and why does P99 matter more?
How does model routing reduce latency for simple queries?
Long-Context Serving
Why Cost Grows with Context
Attention is O(n^2) in sequence length for standard self-attention, and O(n) for linear/efficient variants. Even with optimizations, longer sequences mean:
- KV cache memory: Grows linearly with sequence length (see Topic 5: KV Cache)
- Prefill compute: Processing a 128K prompt costs ~32x more than a 4K prompt
- Throughput reduction: Fewer concurrent long-context requests fit in GPU memory
- Context dilution: Models can lose focus when the relevant information is buried in a large context
Mitigation Strategies
| Strategy | Approach | Trade-off |
|---|---|---|
| Retrieval instead of stuffing | Pull only relevant chunks into context | Requires good retrieval pipeline |
| Context compression | Summarize or prune less-relevant passages | Risk of losing critical details |
| Sliding window attention | Attend only to recent W tokens | Loses long-range dependencies |
| Hierarchical attention | Coarse attention over distant tokens, fine over recent | Architectural complexity |
| Chunked prefill | Process long prompts in chunks to limit peak memory | Adds prefill latency |
Python — Estimating cost scaling with context length
# Show how serving cost and memory scale with context length. # This helps engineers justify retrieval over context-stuffing. def context_cost_scaling(base_len=4096, target_lengths=None): """Compare relative cost at different context lengths.""" if target_lengths is None: target_lengths = [4096, 8192, 16384, 32768, 65536, 131072] print(f"{'Context':>10} {'KV Memory':>12} {'Prefill':>10} {'Max Batch':>10}") print("-" * 46) for length in target_lengths: ratio = length / base_len # KV cache scales linearly kv_ratio = ratio # Prefill compute scales quadratically (standard attention) prefill_ratio = ratio ** 2 # Max concurrent batch inversely proportional to KV size batch_ratio = 1.0 / ratio print(f"{length:>10,} {kv_ratio:>11.1f}x {prefill_ratio:>9.1f}x {batch_ratio:>9.2f}x") context_cost_scaling()
What is context dilution and how do you detect it?
Should you always use the maximum context window available?
The safety controls, architectural patterns, and operational concerns that turn model access into a reliable product service.
Safety & Moderation in Generation
The Safety Pipeline
Safety is a pipeline property, not a single classifier. Controls operate at three stages:
- Pre-generation: Input screening, prompt injection detection, rate limiting, authentication
- During generation: Constrained decoding (blocklists, grammar constraints), tool permission gating
- Post-generation: Output classifiers, PII detection, compliance checks, human review triggers
Safety Control Types
| Control | Stage | Mechanism | Limitation |
|---|---|---|---|
| Content classifier | Pre + Post | ML model flags harmful content | False positives/negatives |
| Blocklist / allowlist | During | Prevent specific tokens or phrases | Brittle; easy to circumvent |
| Refusal training | During | Model trained to decline unsafe requests | Can be jailbroken |
| Tool permissions | Pre | Gate which tools the model can invoke | Requires access control design |
| Human escalation | Post | Route uncertain outputs to human review | Latency; does not scale to all requests |
Streaming Complicates Safety
When responses are streamed, the full output is not available before delivery begins. This means post-generation classifiers either operate with a buffer delay (accumulate tokens before release) or run in parallel and can halt the stream. Both approaches have trade-offs: buffering adds latency; parallel classification adds infrastructure cost. See Topic 3: Streaming Generation for streaming architecture details.
Python — Layered safety pipeline
# A simplified layered safety pipeline showing pre/post generation checks. # In production, each layer is a separate service with its own SLO. class SafetyPipeline: def __init__(self, input_classifier, output_classifier, pii_detector): self.input_clf = input_classifier # Pre-generation safety check self.output_clf = output_classifier # Post-generation content filter self.pii_det = pii_detector # PII redaction layer def check_input(self, prompt: str) -> dict: """Screen input before generation begins.""" result = self.input_clf.classify(prompt) if result.risk_score > 0.85: return {"action": "block", "reason": result.category} if result.risk_score > 0.6: return {"action": "escalate", "reason": result.category} return {"action": "allow"} def check_output(self, response: str) -> str: """Post-process output: moderate content and redact PII.""" # Step 1: Content safety if self.output_clf.is_unsafe(response): return "I cannot provide that information." # Refusal # Step 2: PII redaction response = self.pii_det.redact(response) return response
How do you handle false positives in safety classifiers?
What is the difference between safety and alignment?
How do you test safety controls end-to-end?
Scalable LLM Service Design
Service Architecture Components
| Component | Role | Key Design Decision |
|---|---|---|
| API Gateway | Auth, rate limiting, routing | How to route between model tiers |
| Prompt Assembly | Template + context + retrieval | Caching strategy for common prefixes |
| Model Serving | Inference, batching, KV cache | Continuous batching framework choice |
| Streaming Layer | Token delivery via SSE/WebSocket | Buffer size for safety moderation |
| Safety Pipeline | Input/output moderation | Latency budget for safety checks |
| Observability | Logging, tracing, metrics | What to log without leaking user data |
| Evaluation | Quality regression detection | Automated eval vs human eval balance |
| Cache Layer | Semantic or exact-match caching | Cache invalidation on model updates |
Tiered Model Routing
Not every request needs a frontier model. A routing layer can classify incoming requests by complexity and dispatch them to appropriate model tiers:
- Tier 1 (fast/cheap): Small models for classification, extraction, simple Q&A
- Tier 2 (balanced): Mid-size models for standard generation tasks
- Tier 3 (frontier): Large models for complex reasoning, code generation, creative work
Operational Concerns
The best interview answer is architectural. Show that you can connect model behavior to queues, autoscaling, observability, guardrails, and rollback strategy:
- Autoscaling: Scale GPU instances based on queue depth and latency SLOs
- Rollback: Version model deployments so you can revert if quality regresses
- A/B testing: Route a percentage of traffic to new model versions before full rollout
- Cost attribution: Track per-team, per-feature token usage for capacity planning
Python — Service architecture skeleton with tiered routing
# Skeleton of a tiered LLM service with routing, safety, and observability. # Each component would be a separate service in production. class LLMService: def __init__(self, router, safety, models, logger): self.router = router # Classifies request complexity self.safety = safety # Layered safety pipeline self.models = models # Dict of model tiers: {"fast": ..., "frontier": ...} self.logger = logger # Structured logging + metrics async def handle_request(self, request): """Full request lifecycle: route -> assemble -> generate -> moderate.""" # 1. Input safety check safety_result = self.safety.check_input(request.prompt) if safety_result["action"] == "block": return RefusalResponse(safety_result["reason"]) # 2. Route to appropriate model tier tier = self.router.classify(request) # "fast", "balanced", "frontier" model = self.models[tier] # 3. Assemble prompt (system prompt + retrieved context + user input) full_prompt = assemble_prompt(request, tier) # 4. Generate with streaming async for token in model.generate_stream(full_prompt): yield token # Stream tokens to client # 5. Post-generation: log metrics, check output safety self.logger.log(tier=tier, tokens=token_count, latency_ms=elapsed) # 6. Output moderation (in production, runs in parallel with streaming) final_output = self.safety.check_output(full_response)