How the transformer works, from the high-level breakthrough to the mechanics of attention, QKV, multi-head processing, and positional information.
Why the Transformer Was a Major Breakthrough
What Changed
Before transformers, sequence models like RNNs and LSTMs processed tokens step by step. Token 50 had to wait for tokens 1-49 to be processed first. This created two problems:
- Training was slow — sequential processing could not fully exploit modern GPU parallelism.
- Long-range dependencies were hard — information had to survive being passed through many sequential steps, leading to vanishing gradients and forgetfulness.
The Transformer Solution
The transformer (Vaswani et al., 2017) replaced recurrence entirely with Topic 2: self-attention, allowing every token to directly attend to every other token in the sequence. This meant:
- Parallelism: All tokens can be processed simultaneously during training, mapping efficiently to GPU hardware.
- Direct connections: Token 50 can directly "look at" token 1 without information passing through 49 intermediate steps.
- Scalability: The architecture scales to billions of parameters and billions of training tokens.
Interview Signal
Do not answer only "because of attention." The strongest answer is that the architecture scaled better, trained faster on modern accelerators, and generalized into the foundation of current LLMs. The breakthrough was both algorithmic (attention) and operational (parallelism).
Python Example
# Conceptual comparison: RNN vs Transformer processing
# RNN: sequential — each step depends on the previous
def rnn_process(tokens, hidden_state):
"""Process tokens one at a time, sequentially."""
outputs = []
for token in tokens:
hidden_state = update(hidden_state, token) # can't parallelize
outputs.append(hidden_state)
return outputs # O(n) sequential steps
# Transformer: parallel — all tokens processed at once
def transformer_process(tokens):
"""Process all tokens simultaneously via attention."""
# All-to-all attention computed in one matrix operation
scores = tokens @ tokens.T # every token attends to every other
weights = softmax(scores)
output = weights @ tokens # parallel, GPU-friendly
return output # O(1) parallel steps (O(n^2) compute)
Are RNNs completely obsolete?
What did "Attention Is All You Need" actually propose?
How did the transformer influence non-NLP fields?
What Is Self-Attention?
How It Works
For each token, self-attention computes a weighted sum of all other tokens' representations. The weights are determined by how relevant each other token is to the current one. This is computed via the Topic 3: Query-Key-Value mechanism.
The result is that each token's representation becomes context-sensitive. The same word "bank" produces a different internal representation depending on whether it appears near "river" or "loan."
Why Self-Attention Is Powerful
- Ambiguity resolution: The word "bank" gets disambiguated by attending to context tokens.
- Co-reference: A pronoun "she" can attend to the name it refers to, even if several sentences earlier.
- Long-distance dependencies: A negation ("not") can influence a word many positions later through direct attention.
The Cost
Standard self-attention compares every token to every other token, which means computation and memory grow quadratically with sequence length. A 4,096-token sequence requires 4,096 x 4,096 = ~16.7 million pairwise comparisons per attention head. See Topic 8: Scaling & Long Sequences for how this shapes system design.
Python Example
import numpy as np
# Simplified self-attention for 4 tokens, 3-dim embeddings
tokens = np.array([
[1.0, 0.0, 0.5], # "The"
[0.0, 1.0, 0.8], # "bank"
[0.5, 0.3, 1.0], # "near"
[0.2, 0.9, 0.1], # "river"
])
# Compute raw attention scores (dot product of all pairs)
scores = tokens @ tokens.T # shape: [4, 4]
# Apply softmax to get attention weights (rows sum to 1)
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
weights = softmax(scores)
# Weighted sum: each token's new representation
output = weights @ tokens
print("Attention weights for 'bank':", weights[1].round(3))
Is self-attention the same as cross-attention?
Can attention weights be used to interpret model decisions?
What is the softmax temperature in attention?
1/sqrt(d_k) in scaled dot-product attention acts like a temperature. Without it, dot products grow large in high dimensions, causing softmax to produce extremely peaked distributions that are hard to train. The scale keeps gradients flowing.Query, Key, and Value Vectors
The Matching Step
Each token's embedding is linearly projected into three separate vectors through learned weight matrices. The query (Q) represents what information the token is searching for. The key (K) represents what the token advertises as its identity or content. The dot product Q · KT computes a relevance score between every pair of tokens.
The Content Aggregation Step
After softmax normalizes the scores into a probability distribution, the scores are used to compute a weighted sum of the value (V) vectors. This weighted sum becomes the token's new, context-enriched representation. The key insight: Q-K matching decides who matters, and V decides what information gets forwarded.
Scaled Dot-Product
The scores are divided by √dk (the square root of the key dimension) before softmax. Without this scaling, dot products in high dimensions produce very large values, causing softmax to output near-one-hot distributions with vanishing gradients.
Python Example
import numpy as np
d_k = 4 # key dimension
# Simulated Q, K, V for 3 tokens
Q = np.array([[1,0,1,0], [0,1,0,1], [1,1,0,0]], dtype=float)
K = np.array([[1,1,0,0], [0,0,1,1], [1,0,0,1]], dtype=float)
V = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]], dtype=float)
# Step 1: Compute scaled dot-product scores
scores = Q @ K.T / np.sqrt(d_k) # scale by sqrt(d_k)
# Step 2: Softmax to get attention weights
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
weights = softmax(scores)
# Step 3: Weighted sum of values
output = weights @ V
print("Attention output:", output)
Why are Q, K, V separate projections instead of using the raw embeddings?
What is the computational cost of the QKV computation?
Why Transformers Use Multiple Attention Heads
Specialization Through Subspaces
Instead of computing one large attention operation, multi-head attention splits the model dimension into h smaller subspaces (heads). Each head independently computes Topic 3: Q, K, V attention in its own learned projection space, allowing different heads to capture different types of relationships.
Empirically, researchers have observed that different heads specialize in different linguistic phenomena:
- Syntax heads: Attend to grammatically related tokens (subject-verb, modifier-noun).
- Positional heads: Attend to fixed relative positions (previous token, next token).
- Semantic heads: Attend to thematically related content across long distances.
- Copy heads: Attend to tokens that should be repeated in the output.
How Heads Combine
After each head produces its output, all head outputs are concatenated and passed through a linear projection. This allows the model to learn how to combine the different types of attention into a single, rich representation.
Practical Considerations
| Model | Heads | d_model | d_head |
|---|---|---|---|
| GPT-2 Small | 12 | 768 | 64 |
| GPT-3 175B | 96 | 12,288 | 128 |
| LLaMA 70B | 64 | 8,192 | 128 |
More heads increase flexibility, but each individual head has a smaller subspace. The total computation remains similar because the split is along the model dimension.
Python Example
import numpy as np
# Simplified multi-head attention (2 heads, d_model=4)
d_model = 4
n_heads = 2
d_head = d_model // n_heads # each head gets 2 dims
x = np.random.randn(3, d_model) # 3 tokens
# Each head processes a slice of the dimension
head_outputs = []
for h in range(n_heads):
start = h * d_head
end = start + d_head
x_h = x[:, start:end] # slice for this head
# Simplified attention within this head's subspace
scores = x_h @ x_h.T / np.sqrt(d_head)
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
out_h = weights @ x_h
head_outputs.append(out_h)
print(f"Head {h} output shape:", out_h.shape)
# Concatenate all head outputs
concat = np.concatenate(head_outputs, axis=-1)
print("Concatenated shape:", concat.shape) # back to d_model
Can you prune attention heads without hurting quality?
What is Grouped-Query Attention (GQA)?
Why Transformers Need Positional Encodings
The Permutation Problem
Topic 2: Self-attention computes the same output regardless of token order if no position information is provided. The set {dog, bites, man} produces the same attention scores as {man, bites, dog}. This is a fundamental problem because word order is essential to meaning in almost every language.
Positional Strategies
| Strategy | How It Works | Used By |
|---|---|---|
| Sinusoidal | Fixed sine/cosine functions at different frequencies per dimension | Original transformer |
| Learned embeddings | Position-indexed vectors trained alongside model parameters | GPT-2, BERT |
| Rotary (RoPE) | Rotates Q and K vectors based on position, encoding relative distance | LLaMA, GPT-NeoX, most modern LLMs |
| ALiBi | Adds a linear bias to attention scores based on distance | BLOOM, MPT |
Why RoPE Dominates Modern LLMs
Rotary position embeddings (Su et al., 2021) encode relative position by rotating the query and key vectors. The attention score between two tokens naturally decreases with distance because the rotation angle difference grows. This:
- Generalizes better to unseen sequence lengths (length extrapolation).
- Encodes relative rather than absolute position, which is more linguistically natural.
- Integrates elegantly into the existing Q-K dot product without adding parameters.
Python Example
import numpy as np
# Sinusoidal positional encoding (original transformer)
def sinusoidal_pe(max_len, d_model):
"""Generate sinusoidal position encodings."""
pe = np.zeros((max_len, d_model))
pos = np.arange(max_len)[:, np.newaxis]
div = np.exp(np.arange(0, d_model, 2) * -np.log(10000.0) / d_model)
pe[:, 0::2] = np.sin(pos * div) # even dimensions: sine
pe[:, 1::2] = np.cos(pos * div) # odd dimensions: cosine
return pe
pe = sinusoidal_pe(max_len=10, d_model=8)
print("Position 0:", pe[0].round(3))
print("Position 5:", pe[5].round(3))
# Adding PE to token embeddings
token_embs = np.random.randn(10, 8) # 10 tokens, 8-dim
positioned = token_embs + pe # simply add position signal
Can transformers generalize to longer sequences than they were trained on?
Why don't vision transformers always need positional encodings?
How architecture choices map to real engineering constraints — model variants, stabilization machinery, scaling costs, masking strategies, and failure modes.
Encoder-Only, Decoder-Only, and Encoder-Decoder
Architecture Determines Information Flow
The attention mask pattern — which tokens can see which — fundamentally changes what the model is good at. See Topic 9: Causal vs Bidirectional for details on masking.
Comparison
| Property | Encoder-Only | Decoder-Only | Encoder-Decoder |
|---|---|---|---|
| Attention | Bidirectional | Causal (left-to-right) | Bidirectional (enc) + Causal (dec) |
| Strengths | Understanding, classification | Generation, in-context learning | Sequence-to-sequence tasks |
| Weaknesses | Cannot generate fluently | Cannot use future context | More complex, two models to serve |
| Training | Masked LM (MLM) | Next-token prediction | Span corruption / denoising |
Why Decoder-Only Won
Despite encoder-decoder models being the original transformer design, decoder-only models have become dominant for LLMs because:
- Simplicity: One model, one architecture, one training objective (next-token prediction).
- Emergent capabilities: In-context learning, chain-of-thought reasoning, and instruction following all emerge naturally from autoregressive training at scale.
- Unification: A sufficiently large decoder-only model can handle classification, translation, and generation through prompting rather than architectural specialization.
Python Example
# Choosing the right architecture for the task
# Encoder-only: best for embeddings and classification
from transformers import AutoModel
encoder = AutoModel.from_pretrained("bert-base-uncased")
# Use for: semantic search, NER, sentiment analysis
# Decoder-only: best for generation
from transformers import AutoModelForCausalLM
decoder = AutoModelForCausalLM.from_pretrained("gpt2")
# Use for: text completion, chat, code generation
# Encoder-decoder: best for structured transformation
from transformers import T5ForConditionalGeneration
enc_dec = T5ForConditionalGeneration.from_pretrained("t5-base")
# Use for: translation, summarization, structured output
Can a decoder-only model do classification well?
Why is BERT still used despite being older than GPT-style models?
What is prefix LM and how does it relate?
Feed-Forward Blocks, Residual Paths, and Layer Normalization
Feed-Forward Network (FFN)
The FFN applies two linear transformations with a nonlinearity (typically GELU or SiLU) between them: FFN(x) = W2 * activation(W1 * x + b1) + b2. The inner dimension is typically 4x the model dimension (e.g., 3072 for d_model=768).
Crucially, the FFN operates independently on each token position. While attention mixes information across tokens, the FFN processes each token's representation in isolation. This is where the model applies nonlinear transformations to the attended information.
Residual Connections
Each sub-layer (attention and FFN) wraps its output with a residual connection: output = sublayer(x) + x. This means each block learns a refinement rather than a full replacement, which:
- Preserves gradient flow through dozens or hundreds of layers.
- Allows information to pass through unchanged if a particular layer is not useful.
- Enables much deeper networks than would otherwise be trainable.
Layer Normalization
LayerNorm normalizes activations across the feature dimension for each token, keeping values in a stable range. Modern LLMs use Pre-Norm (normalize before the sub-layer) rather than Post-Norm (after), which improves training stability for very deep models.
Some recent models use RMSNorm instead of LayerNorm, which skips the mean-centering step for slight efficiency gains.
Python Example
import numpy as np
def layer_norm(x, eps=1e-5):
"""Normalize across feature dimension."""
mean = x.mean(axis=-1, keepdims=True)
var = x.var(axis=-1, keepdims=True)
return (x - mean) / np.sqrt(var + eps)
def ffn(x, w1, w2):
"""Position-wise feed-forward with GELU activation."""
hidden = np.maximum(0, x @ w1) # simplified ReLU
return hidden @ w2
# Simulated transformer block
x = np.random.randn(4, 8) # 4 tokens, 8-dim
# Pre-norm attention + residual
normed = layer_norm(x)
attn_out = normed # simplified: pretend attention ran
x = x + attn_out # residual connection
# Pre-norm FFN + residual
normed = layer_norm(x)
w1 = np.random.randn(8, 32) # expand 4x
w2 = np.random.randn(32, 8) # contract back
x = x + ffn(normed, w1, w2) # residual connection
Why is the FFN inner dimension 4x the model dimension?
What is the difference between Pre-Norm and Post-Norm?
Why Transformers Scale Well but Become Expensive on Long Sequences
The Quadratic Wall
For a sequence of length n, standard self-attention requires O(n²) pairwise comparisons per head per layer. This means:
| Sequence Length | Attention Operations | Relative Cost |
|---|---|---|
| 2K tokens | 4 million | 1x |
| 8K tokens | 64 million | 16x |
| 32K tokens | 1 billion | 256x |
| 128K tokens | 16 billion | 4,096x |
Engineering Mitigations
The quadratic cost has driven major engineering effort:
- KV caching: During autoregressive generation, store previously computed key-value pairs to avoid recomputation. This trades memory for compute.
- Flash Attention: Fuses attention operations to reduce memory I/O, achieving 2-4x speedups without changing the math.
- Sparse attention: Patterns like sliding window (Longformer) or block-sparse attention reduce per-layer cost from O(n²) to O(n log n) or O(n).
- Grouped-Query Attention: Shares K/V heads across query head groups, reducing KV cache size by 4-8x.
- Context compression: Summarize or chunk long inputs to reduce effective sequence length.
Systems-Level Impact
In interviews, connect architecture to systems: the same design that made transformers dominant also created strong incentives for context optimization, batching strategies, and KV caching. Long context is an engineering problem, not just a model capability checkbox.
Python Example
# Demonstrate quadratic scaling of attention
def attention_cost(seq_len, d_model=4096, n_heads=32, n_layers=32):
"""Estimate FLOPs for attention in a transformer."""
d_head = d_model // n_heads
# QKV projections: O(n * d^2) per layer
qkv_flops = 3 * seq_len * d_model * d_model
# Attention scores: O(n^2 * d) per head per layer
attn_flops = n_heads * seq_len * seq_len * d_head
# Total per layer, times layers
total = (qkv_flops + attn_flops) * n_layers
return total
for length in [2048, 8192, 32768, 131072]:
flops = attention_cost(length)
print(f"{length:>7} tokens: {flops/1e12:.1f} TFLOPs")
What is Flash Attention and why does it help?
How large is a KV cache for a 70B model at 128K context?
Will sub-quadratic attention replacements make transformers obsolete?
Causal Masking vs Bidirectional Attention
How Masking Works
The attention mask is applied before softmax. Masked positions are set to negative infinity, causing softmax to assign them zero weight. This simple mechanism fundamentally controls what information each token can access.
Causal (Autoregressive) Masking
In a causal mask, token i can only attend to tokens 0 through i. This is required for next-token prediction because the model must not see future tokens during training — otherwise, it could simply copy the answer. All GPT-style and Claude-style models use causal masking.
Bidirectional Attention
In bidirectional attention, every token attends to every other token (no masking). This is used by encoder models like BERT for understanding tasks: classification, retrieval, and NER. Because the model sees full context, it produces richer representations, but it cannot generate text autoregressively.
Architecture and Objective Are Coupled
The deeper point is that changing the attention mask changes what the model is allowed to know. This is why Topic 6: architecture variants are not just structural categories — they represent fundamentally different information flows that suit different tasks.
Python Example
import numpy as np
n = 5 # sequence length
# Causal mask: lower triangle (including diagonal)
causal_mask = np.tril(np.ones((n, n)))
print("Causal mask:")
print(causal_mask)
# [[1, 0, 0, 0, 0], token 0 sees only itself
# [1, 1, 0, 0, 0], token 1 sees tokens 0-1
# [1, 1, 1, 0, 0], ...
# [1, 1, 1, 1, 0],
# [1, 1, 1, 1, 1]] token 4 sees all
# Bidirectional: all ones (no masking)
bidir_mask = np.ones((n, n))
print("Bidirectional mask:")
print(bidir_mask)
# Apply mask to attention scores
scores = np.random.randn(n, n)
masked = scores + (1 - causal_mask) * (-1e9) # -inf for masked
Can you mix causal and bidirectional attention in one model?
Does the causal mask hurt generation quality since tokens cannot see future context?
Common Transformer Failure Modes
Attention Diffusion
As context length increases, attention weights spread across more tokens, diluting the model's ability to focus on the most relevant information. This is particularly problematic for the "lost in the middle" phenomenon, where information placed in the middle of a long context is attended to less than information at the beginning or end.
Positional Degradation
Models trained on sequences up to length N often degrade at length N+1. Even with Topic 5: RoPE and its extensions, there is no guarantee that extrapolation will preserve quality. Testing at your target context length is essential.
Hallucination and Context Dilution
Hallucination is not random — it occurs when the model is asked to produce information beyond its training data or retrieval context. Context dilution happens when too much irrelevant information in the prompt causes the model to attend to noise rather than signal. Both are managed through system design:
- Retrieval quality: Better retrieval reduces hallucination.
- Prompt engineering: Shorter, more focused prompts reduce context dilution.
- Grounding: Citation requirements and tool use force the model to reference sources.
Decoding Configuration
| Setting | Too Low | Too High | Recommended |
|---|---|---|---|
| Temperature | Repetitive, dull | Incoherent, random | 0.0-0.3 (factual), 0.7-1.0 (creative) |
| Top-p | Too deterministic | Low-probability tokens appear | 0.9-0.95 for most tasks |
| Top-k | Limits diversity | Allows noise | 40-100, or disable in favor of top-p |
Python Example
# Demonstrate "lost in the middle" attention pattern
import numpy as np
def simulated_attention(seq_len, query_pos=-1):
"""Simulate attention pattern showing primacy/recency bias."""
positions = np.arange(seq_len)
# Attention tends toward start (primacy) and end (recency)
primacy = np.exp(-positions / 20.0)
recency = np.exp(-(seq_len - positions) / 20.0)
middle_penalty = 0.3 # middle positions get less attention
weights = primacy + recency + middle_penalty
weights = weights / weights.sum() # normalize
return weights
weights = simulated_attention(100)
print("Start weight:", weights[:5].sum().round(3))
print("Middle weight:", weights[40:60].sum().round(3))
print("End weight:", weights[-5:].sum().round(3))