Ch 4: Transformer Architecture, Attention & Positional Reasoning

Architecture & Mechanics

How the transformer works, from the high-level breakthrough to the mechanics of attention, QKV, multi-head processing, and positional information.

Why the Transformer Was a Major Breakthrough

The transformer replaced recurrence with attention, allowing each token to directly weigh other relevant tokens in the sequence while enabling far more parallel computation. The breakthrough was both algorithmic and operational.

💡 RNNs read a book one word at a time, remembering as they go. Transformers see the whole page at once and decide what to focus on.

RNNs/LSTMs

Sequential, slow

Attention (2014)

Added to RNNs

Transformer (2017)

Attention is all you need

Modern LLMs

GPT, Claude, etc.

The shift from sequential to parallel processing enabled the scale of today's language models.

What Changed

Before transformers, sequence models like RNNs and LSTMs processed tokens step by step. Token 50 had to wait for tokens 1-49 to be processed first. This created two problems:

Training was slow — sequential processing could not fully exploit modern GPU parallelism.
Long-range dependencies were hard — information had to survive being passed through many sequential steps, leading to vanishing gradients and forgetfulness.

The Transformer Solution

The transformer (Vaswani et al., 2017) replaced recurrence entirely with Topic 2: self-attention, allowing every token to directly attend to every other token in the sequence. This meant:

Parallelism: All tokens can be processed simultaneously during training, mapping efficiently to GPU hardware.
Direct connections: Token 50 can directly "look at" token 1 without information passing through 49 intermediate steps.
Scalability: The architecture scales to billions of parameters and billions of training tokens.

Interview Signal

Do not answer only "because of attention." The strongest answer is that the architecture scaled better, trained faster on modern accelerators, and generalized into the foundation of current LLMs. The breakthrough was both algorithmic (attention) and operational (parallelism).

→ The transformer's power comes from replacing sequential processing with parallel attention — improving both modeling quality and hardware utilization simultaneously.

Python Example

# Conceptual comparison: RNN vs Transformer processing

# RNN: sequential — each step depends on the previous
def rnn_process(tokens, hidden_state):
    """Process tokens one at a time, sequentially."""
    outputs = []
    for token in tokens:
        hidden_state = update(hidden_state, token)  # can't parallelize
        outputs.append(hidden_state)
    return outputs  # O(n) sequential steps

# Transformer: parallel — all tokens processed at once
def transformer_process(tokens):
    """Process all tokens simultaneously via attention."""
    # All-to-all attention computed in one matrix operation
    scores = tokens @ tokens.T  # every token attends to every other
    weights = softmax(scores)
    output = weights @ tokens   # parallel, GPU-friendly
    return output  # O(1) parallel steps (O(n^2) compute)

Follow-up Questions

Are RNNs completely obsolete?

Not entirely. Recent architectures like Mamba and RWKV revisit recurrence-like patterns with linear attention, achieving competitive quality with lower memory costs on very long sequences. However, the transformer remains the dominant architecture for large-scale language models.

What did "Attention Is All You Need" actually propose?

The 2017 paper proposed an encoder-decoder architecture built entirely from self-attention and feed-forward layers, with no recurrence or convolution. It was initially demonstrated on machine translation, but the architecture proved far more general than the original task.

How did the transformer influence non-NLP fields?

Transformers have been adopted for computer vision (Vision Transformer / ViT), audio processing (Whisper), protein folding (AlphaFold), and reinforcement learning (Decision Transformer). The self-attention mechanism generalizes to any data that can be represented as a sequence.

What Is Self-Attention?

Self-attention lets each token look at every other token in the sequence and decide which ones matter most for building its representation. A token representing "bank" can attend to "river" or "loan" nearby and shift its meaning accordingly.

💡 Every token asks: "Which other tokens should I consult before I update my understanding?" Attention is the mechanism that answers that question.

Click a token to see which other tokens it attends to most strongly.

Attention weights:

How It Works

For each token, self-attention computes a weighted sum of all other tokens' representations. The weights are determined by how relevant each other token is to the current one. This is computed via the Topic 3: Query-Key-Value mechanism.

The result is that each token's representation becomes context-sensitive. The same word "bank" produces a different internal representation depending on whether it appears near "river" or "loan."

Why Self-Attention Is Powerful

Ambiguity resolution: The word "bank" gets disambiguated by attending to context tokens.
Co-reference: A pronoun "she" can attend to the name it refers to, even if several sentences earlier.
Long-distance dependencies: A negation ("not") can influence a word many positions later through direct attention.

The Cost

Standard self-attention compares every token to every other token, which means computation and memory grow quadratically with sequence length. A 4,096-token sequence requires 4,096 x 4,096 = ~16.7 million pairwise comparisons per attention head. See Topic 8: Scaling & Long Sequences for how this shapes system design.

→ Self-attention builds context-sensitive meaning by letting each token dynamically weight every other token. Clarity beats jargon in interviews: explain the mechanism, then describe the engineering consequence.

Python Example

import numpy as np

# Simplified self-attention for 4 tokens, 3-dim embeddings
tokens = np.array([
    [1.0, 0.0, 0.5],  # "The"
    [0.0, 1.0, 0.8],  # "bank"
    [0.5, 0.3, 1.0],  # "near"
    [0.2, 0.9, 0.1],  # "river"
])

# Compute raw attention scores (dot product of all pairs)
scores = tokens @ tokens.T  # shape: [4, 4]

# Apply softmax to get attention weights (rows sum to 1)
def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

weights = softmax(scores)

# Weighted sum: each token's new representation
output = weights @ tokens
print("Attention weights for 'bank':", weights[1].round(3))

Follow-up Questions

Is self-attention the same as cross-attention?

Self-attention attends within the same sequence (query, key, and value all come from the same input). Cross-attention attends from one sequence to another (e.g., a decoder attending to encoder outputs in translation). The mechanism is the same; the source of Q, K, V differs.

Can attention weights be used to interpret model decisions?

With caution. Attention weights show which tokens the model focused on, but they do not provide a complete causal explanation. High attention does not necessarily mean the token was the reason for the output. Attention is a useful diagnostic signal, not a definitive interpretation tool.

What is the softmax temperature in attention?

The scaling factor 1/sqrt(d_k) in scaled dot-product attention acts like a temperature. Without it, dot products grow large in high dimensions, causing softmax to produce extremely peaked distributions that are hard to train. The scale keeps gradients flowing.

Query, Key, and Value Vectors

Queries represent what a token is looking for. Keys represent what each token offers as a signal. Values represent the content that gets mixed once relevance is determined. Query-key similarity decides who matters; values decide what information gets copied forward.

💡 Think of a library: the Query is your search request, Keys are the index cards, and Values are the actual book contents. You search the index to decide which books to read.

Token Embedding

→

W_Q × x = Query

W_K × x = Key

W_V × x = Value

K^T

→

Attention Scores

→

softmax

→

Output

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

The Matching Step

Each token's embedding is linearly projected into three separate vectors through learned weight matrices. The query (Q) represents what information the token is searching for. The key (K) represents what the token advertises as its identity or content. The dot product Q · K^T computes a relevance score between every pair of tokens.

The Content Aggregation Step

After softmax normalizes the scores into a probability distribution, the scores are used to compute a weighted sum of the value (V) vectors. This weighted sum becomes the token's new, context-enriched representation. The key insight: Q-K matching decides who matters, and V decides what information gets forwarded.

Scaled Dot-Product

The scores are divided by √d_k (the square root of the key dimension) before softmax. Without this scaling, dot products in high dimensions produce very large values, causing softmax to output near-one-hot distributions with vanishing gradients.

Interview tip: Interviewers like this question because it reveals whether you truly understand attention or only memorize vocabulary. Explain both the matching step (Q · K) and the content aggregation step (weighted V).

→ Q-K similarity determines attention weights (who to listen to); V provides the content (what to take away). The separation of matching from content is what makes attention so flexible.

Python Example

import numpy as np

d_k = 4  # key dimension

# Simulated Q, K, V for 3 tokens
Q = np.array([[1,0,1,0], [0,1,0,1], [1,1,0,0]], dtype=float)
K = np.array([[1,1,0,0], [0,0,1,1], [1,0,0,1]], dtype=float)
V = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]], dtype=float)

# Step 1: Compute scaled dot-product scores
scores = Q @ K.T / np.sqrt(d_k)  # scale by sqrt(d_k)

# Step 2: Softmax to get attention weights
def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

weights = softmax(scores)

# Step 3: Weighted sum of values
output = weights @ V
print("Attention output:", output)

Follow-up Questions

Why are Q, K, V separate projections instead of using the raw embeddings?

Separate learned projections give the model flexibility to learn different representations for "what I'm looking for" (Q) vs. "what I offer" (K) vs. "what I contain" (V). Using raw embeddings would force a single representation to serve all three roles, limiting the model's expressiveness.

What is the computational cost of the QKV computation?

For a sequence of length n with model dimension d, computing Q, K, V requires O(n · d²) for the projections plus O(n² · d) for the attention matrix multiplication. The n² term is what makes long sequences expensive.

Why Transformers Use Multiple Attention Heads

Multiple heads allow the model to learn several types of relationships in parallel. One head may focus on syntax, another on entity references, another on discourse structure. Each head gets its own projection space for specialization.

💡 Multi-head attention is like having multiple analysts reading the same report, each looking for different things — one tracks names, another tracks numbers, another tracks sentiment.

Head 1

Local syntax & grammar

Head 2

Entity references

Head 3

Positional patterns

Head 4

Discourse / topic

Concat + Linear Projection → Output

Each head operates in its own subspace. Outputs are concatenated and projected.

Specialization Through Subspaces

Instead of computing one large attention operation, multi-head attention splits the model dimension into h smaller subspaces (heads). Each head independently computes Topic 3: Q, K, V attention in its own learned projection space, allowing different heads to capture different types of relationships.

Empirically, researchers have observed that different heads specialize in different linguistic phenomena:

Syntax heads: Attend to grammatically related tokens (subject-verb, modifier-noun).
Positional heads: Attend to fixed relative positions (previous token, next token).
Semantic heads: Attend to thematically related content across long distances.
Copy heads: Attend to tokens that should be repeated in the output.

How Heads Combine

After each head produces its output, all head outputs are concatenated and passed through a linear projection. This allows the model to learn how to combine the different types of attention into a single, rich representation.

Practical Considerations

Model	Heads	d_model	d_head
GPT-2 Small	12	768	64
GPT-3 175B	96	12,288	128
LLaMA 70B	64	8,192	128

More heads increase flexibility, but each individual head has a smaller subspace. The total computation remains similar because the split is along the model dimension.

→ Multi-head attention increases representational richness by letting different heads specialize in different relationship types. Avoid saying "more heads is always better" — value depends on model size, task, and training quality.

Python Example

import numpy as np

# Simplified multi-head attention (2 heads, d_model=4)
d_model = 4
n_heads = 2
d_head = d_model // n_heads  # each head gets 2 dims

x = np.random.randn(3, d_model)  # 3 tokens

# Each head processes a slice of the dimension
head_outputs = []
for h in range(n_heads):
    start = h * d_head
    end = start + d_head
    x_h = x[:, start:end]  # slice for this head

    # Simplified attention within this head's subspace
    scores = x_h @ x_h.T / np.sqrt(d_head)
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
    out_h = weights @ x_h
    head_outputs.append(out_h)
    print(f"Head {h} output shape:", out_h.shape)

# Concatenate all head outputs
concat = np.concatenate(head_outputs, axis=-1)
print("Concatenated shape:", concat.shape)  # back to d_model

Follow-up Questions

Can you prune attention heads without hurting quality?

Yes, research shows that many heads can be pruned with minimal quality loss. Head pruning studies found that some heads are redundant or capture overlapping patterns. This has practical implications for inference efficiency — fewer heads means less computation per layer.

What is Grouped-Query Attention (GQA)?

GQA shares key-value heads across groups of query heads, reducing KV cache memory by 4-8x with minimal quality loss. LLaMA 2 70B and many modern models use GQA to serve longer contexts more efficiently. It sits between standard multi-head attention and multi-query attention in the efficiency-quality trade-off.

Why Transformers Need Positional Encodings

Attention alone is permutation-invariant — it knows which tokens are present but not their order. Positional encodings inject sequence order so the model can distinguish "dog bites man" from "man bites dog."

💡 Without position signals, the transformer sees a bag of Scrabble tiles. Positional encoding arranges them on the board.

Each token gets a position signal added to its embedding.

Sinusoidal functions encode position as unique frequency patterns. Used in the original transformer.

The Permutation Problem

Topic 2: Self-attention computes the same output regardless of token order if no position information is provided. The set {dog, bites, man} produces the same attention scores as {man, bites, dog}. This is a fundamental problem because word order is essential to meaning in almost every language.

Positional Strategies

Strategy	How It Works	Used By
Sinusoidal	Fixed sine/cosine functions at different frequencies per dimension	Original transformer
Learned embeddings	Position-indexed vectors trained alongside model parameters	GPT-2, BERT
Rotary (RoPE)	Rotates Q and K vectors based on position, encoding relative distance	LLaMA, GPT-NeoX, most modern LLMs
ALiBi	Adds a linear bias to attention scores based on distance	BLOOM, MPT

Why RoPE Dominates Modern LLMs

Rotary position embeddings (Su et al., 2021) encode relative position by rotating the query and key vectors. The attention score between two tokens naturally decreases with distance because the rotation angle difference grows. This:

Generalizes better to unseen sequence lengths (length extrapolation).
Encodes relative rather than absolute position, which is more linguistically natural.
Integrates elegantly into the existing Q-K dot product without adding parameters.

→ Positional encoding is required because attention is order-blind by design. Modern LLMs use RoPE for its relative-position properties and length generalization.

Python Example

import numpy as np

# Sinusoidal positional encoding (original transformer)
def sinusoidal_pe(max_len, d_model):
    """Generate sinusoidal position encodings."""
    pe = np.zeros((max_len, d_model))
    pos = np.arange(max_len)[:, np.newaxis]
    div = np.exp(np.arange(0, d_model, 2) * -np.log(10000.0) / d_model)

    pe[:, 0::2] = np.sin(pos * div)  # even dimensions: sine
    pe[:, 1::2] = np.cos(pos * div)  # odd dimensions: cosine
    return pe

pe = sinusoidal_pe(max_len=10, d_model=8)
print("Position 0:", pe[0].round(3))
print("Position 5:", pe[5].round(3))

# Adding PE to token embeddings
token_embs = np.random.randn(10, 8)  # 10 tokens, 8-dim
positioned = token_embs + pe  # simply add position signal

Follow-up Questions

Can transformers generalize to longer sequences than they were trained on?

It depends on the positional encoding. Learned absolute embeddings fail beyond the training length. RoPE and ALiBi degrade more gracefully but still lose quality. Techniques like YaRN and NTK-aware scaling extend RoPE to longer contexts by adjusting the rotation frequencies.

Why don't vision transformers always need positional encodings?

They usually do include them. Vision transformers (ViT) add learned 2D position embeddings to image patches. However, some architectures show that with enough training data, models can partially infer spatial relationships from the data itself, making position embeddings less critical than in text.

Systems & Trade-offs

How architecture choices map to real engineering constraints — model variants, stabilization machinery, scaling costs, masking strategies, and failure modes.

Encoder-Only, Decoder-Only, and Encoder-Decoder

Encoder-only models use bidirectional attention for understanding tasks. Decoder-only models generate text autoregressively with causal masking. Encoder-decoder models separate input encoding from output generation. Architecture determines what the model is best suited to do.

💡 Encoder-only is a reader (understands text). Decoder-only is a writer (generates text). Encoder-decoder is a translator (reads input, then writes output).

Encoder-Only

BERT, RoBERTa

Bidirectional attention. Classification, retrieval, NER.

Decoder-Only

GPT, Claude, LLaMA

Causal attention. Text generation, chat, code.

Encoder-Decoder

T5, BART, mBART

Separate encode + decode. Translation, summarization.

Architecture Determines Information Flow

The attention mask pattern — which tokens can see which — fundamentally changes what the model is good at. See Topic 9: Causal vs Bidirectional for details on masking.

Comparison

Property	Encoder-Only	Decoder-Only	Encoder-Decoder
Attention	Bidirectional	Causal (left-to-right)	Bidirectional (enc) + Causal (dec)
Strengths	Understanding, classification	Generation, in-context learning	Sequence-to-sequence tasks
Weaknesses	Cannot generate fluently	Cannot use future context	More complex, two models to serve
Training	Masked LM (MLM)	Next-token prediction	Span corruption / denoising

Why Decoder-Only Won

Despite encoder-decoder models being the original transformer design, decoder-only models have become dominant for LLMs because:

Simplicity: One model, one architecture, one training objective (next-token prediction).
Emergent capabilities: In-context learning, chain-of-thought reasoning, and instruction following all emerge naturally from autoregressive training at scale.
Unification: A sufficiently large decoder-only model can handle classification, translation, and generation through prompting rather than architectural specialization.

→ Architecture is not just taxonomy — it determines what the model is best suited to do. Decoder-only models dominate current LLMs, but encoder models remain the best choice for embedding and classification tasks.

Python Example

# Choosing the right architecture for the task

# Encoder-only: best for embeddings and classification
from transformers import AutoModel
encoder = AutoModel.from_pretrained("bert-base-uncased")
# Use for: semantic search, NER, sentiment analysis

# Decoder-only: best for generation
from transformers import AutoModelForCausalLM
decoder = AutoModelForCausalLM.from_pretrained("gpt2")
# Use for: text completion, chat, code generation

# Encoder-decoder: best for structured transformation
from transformers import T5ForConditionalGeneration
enc_dec = T5ForConditionalGeneration.from_pretrained("t5-base")
# Use for: translation, summarization, structured output

Follow-up Questions

Can a decoder-only model do classification well?

Yes, through prompting. Large decoder-only models classify text by generating the label as text (e.g., "positive" or "negative"). For smaller models or high-throughput classification, encoder-only models remain more efficient because they produce a fixed-size representation without generation overhead.

Why is BERT still used despite being older than GPT-style models?

BERT-style encoders are faster and cheaper for tasks that do not require generation: semantic search, classification, named entity recognition, and sentence similarity. A BERT model with 110M parameters can match or beat a 7B decoder for these tasks at 1/60th the compute cost.

What is prefix LM and how does it relate?

Prefix LM (used by PaLM) allows bidirectional attention over a prefix portion of the input, then switches to causal attention for generation. This combines some benefits of both encoder and decoder architectures within a single model, enabling better input understanding while retaining generation ability.

Feed-Forward Blocks, Residual Paths, and Layer Normalization

Attention mixes information across tokens; the feed-forward network transforms each token individually. Residual connections preserve gradient flow, and layer normalization keeps activations stable. The transformer is not just attention — it is attention plus stabilization machinery.

💡 Attention is a group discussion. The FFN is each person quietly processing what they heard. Residuals ensure no one forgets what they already knew. LayerNorm keeps the conversation at a reasonable volume.

Input Tokens

↓

Layer Norm

↓

Multi-Head Attention

+ residual

Layer Norm

↓

Feed-Forward Network

+ residual

Output Tokens

One transformer layer: LayerNorm → Attention + Residual → LayerNorm → FFN + Residual

Feed-Forward Network (FFN)

The FFN applies two linear transformations with a nonlinearity (typically GELU or SiLU) between them: FFN(x) = W2 * activation(W1 * x + b1) + b2. The inner dimension is typically 4x the model dimension (e.g., 3072 for d_model=768).

Crucially, the FFN operates independently on each token position. While attention mixes information across tokens, the FFN processes each token's representation in isolation. This is where the model applies nonlinear transformations to the attended information.

Residual Connections

Each sub-layer (attention and FFN) wraps its output with a residual connection: output = sublayer(x) + x. This means each block learns a refinement rather than a full replacement, which:

Preserves gradient flow through dozens or hundreds of layers.
Allows information to pass through unchanged if a particular layer is not useful.
Enables much deeper networks than would otherwise be trainable.

Layer Normalization

LayerNorm normalizes activations across the feature dimension for each token, keeping values in a stable range. Modern LLMs use Pre-Norm (normalize before the sub-layer) rather than Post-Norm (after), which improves training stability for very deep models.

Some recent models use RMSNorm instead of LayerNorm, which skips the mean-centering step for slight efficiency gains.

→ The transformer is not just attention. It is attention plus repeated stabilization (LayerNorm) and transformation (FFN) machinery, connected by residual paths that enable deep stacking.

Python Example

import numpy as np

def layer_norm(x, eps=1e-5):
    """Normalize across feature dimension."""
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return (x - mean) / np.sqrt(var + eps)

def ffn(x, w1, w2):
    """Position-wise feed-forward with GELU activation."""
    hidden = np.maximum(0, x @ w1)  # simplified ReLU
    return hidden @ w2

# Simulated transformer block
x = np.random.randn(4, 8)  # 4 tokens, 8-dim

# Pre-norm attention + residual
normed = layer_norm(x)
attn_out = normed  # simplified: pretend attention ran
x = x + attn_out   # residual connection

# Pre-norm FFN + residual
normed = layer_norm(x)
w1 = np.random.randn(8, 32)  # expand 4x
w2 = np.random.randn(32, 8)  # contract back
x = x + ffn(normed, w1, w2)  # residual connection

Follow-up Questions

Why is the FFN inner dimension 4x the model dimension?

The 4x expansion is a design choice from the original transformer paper that has been widely adopted. A larger inner dimension gives the FFN more capacity for nonlinear transformation. Some models use different ratios; Mixture-of-Experts models use even larger FFNs but only activate a subset per token.

What is the difference between Pre-Norm and Post-Norm?

Pre-Norm applies LayerNorm before the sub-layer, making gradients more stable at initialization. Post-Norm applies it after. Pre-Norm is strongly preferred for deep models (50+ layers) because Post-Norm can cause training instability, though Post-Norm may achieve slightly better final quality when training is stable.

Why Transformers Scale Well but Become Expensive on Long Sequences

Transformers scale well because attention can be computed in parallel. But standard self-attention compares every token pairwise, so computation and memory grow quadratically with sequence length. Long context is not free.

💡 Doubling the number of guests at a dinner party more than doubles the conversations that need to happen — every guest must talk to every other guest.

16M

Attention ops

128MB

KV Cache

Relative cost

The Quadratic Wall

For a sequence of length n, standard self-attention requires O(n²) pairwise comparisons per head per layer. This means:

Sequence Length	Attention Operations	Relative Cost
2K tokens	4 million	1x
8K tokens	64 million	16x
32K tokens	1 billion	256x
128K tokens	16 billion	4,096x

Engineering Mitigations

The quadratic cost has driven major engineering effort:

KV caching: During autoregressive generation, store previously computed key-value pairs to avoid recomputation. This trades memory for compute.
Flash Attention: Fuses attention operations to reduce memory I/O, achieving 2-4x speedups without changing the math.
Sparse attention: Patterns like sliding window (Longformer) or block-sparse attention reduce per-layer cost from O(n²) to O(n log n) or O(n).
Grouped-Query Attention: Shares K/V heads across query head groups, reducing KV cache size by 4-8x.
Context compression: Summarize or chunk long inputs to reduce effective sequence length.

Systems-Level Impact

In interviews, connect architecture to systems: the same design that made transformers dominant also created strong incentives for context optimization, batching strategies, and KV caching. Long context is an engineering problem, not just a model capability checkbox.

→ Long context is not free. Engineers pay for it in latency, throughput, and memory pressure. The quadratic cost of attention is the central bottleneck of transformer serving.

Python Example

# Demonstrate quadratic scaling of attention

def attention_cost(seq_len, d_model=4096, n_heads=32, n_layers=32):
    """Estimate FLOPs for attention in a transformer."""
    d_head = d_model // n_heads

    # QKV projections: O(n * d^2) per layer
    qkv_flops = 3 * seq_len * d_model * d_model

    # Attention scores: O(n^2 * d) per head per layer
    attn_flops = n_heads * seq_len * seq_len * d_head

    # Total per layer, times layers
    total = (qkv_flops + attn_flops) * n_layers
    return total

for length in [2048, 8192, 32768, 131072]:
    flops = attention_cost(length)
    print(f"{length:>7} tokens: {flops/1e12:.1f} TFLOPs")

Follow-up Questions

What is Flash Attention and why does it help?

Flash Attention (Dao et al., 2022) restructures the attention computation to minimize reads/writes to GPU high-bandwidth memory. It computes attention in tiles, keeping intermediate results in fast SRAM. This achieves 2-4x wall-clock speedups and reduces memory usage, without approximating the attention math.

How large is a KV cache for a 70B model at 128K context?

For a 70B model with 64 layers and 8 KV heads (GQA), storing 128K tokens in float16 requires roughly 40-80 GB of GPU memory just for the KV cache. This is often larger than the model weights themselves, making KV cache management the primary challenge for long-context serving.

Will sub-quadratic attention replacements make transformers obsolete?

Possibly for some use cases. Linear attention variants (Mamba, RWKV) achieve O(n) scaling but may sacrifice some modeling quality for complex reasoning tasks. The transformer's quadratic attention remains state-of-the-art for quality; the question is whether efficient alternatives can close the gap.

Causal Masking vs Bidirectional Attention

Causal masking prevents a token from attending to future tokens, which is essential for autoregressive generation. Bidirectional attention lets each token see both left and right context. The mask defines the information flow and determines what the model can know during training and inference.

💡 Causal masking is like writing a story — you cannot read what you have not written yet. Bidirectional attention is like editing a draft — you can see the whole document at once.

How Masking Works

The attention mask is applied before softmax. Masked positions are set to negative infinity, causing softmax to assign them zero weight. This simple mechanism fundamentally controls what information each token can access.

Causal (Autoregressive) Masking

In a causal mask, token i can only attend to tokens 0 through i. This is required for next-token prediction because the model must not see future tokens during training — otherwise, it could simply copy the answer. All GPT-style and Claude-style models use causal masking.

Bidirectional Attention

In bidirectional attention, every token attends to every other token (no masking). This is used by encoder models like BERT for understanding tasks: classification, retrieval, and NER. Because the model sees full context, it produces richer representations, but it cannot generate text autoregressively.

Architecture and Objective Are Coupled

The deeper point is that changing the attention mask changes what the model is allowed to know. This is why Topic 6: architecture variants are not just structural categories — they represent fundamentally different information flows that suit different tasks.

→ The attention mask defines the information flow. Causal masking enables generation; bidirectional attention enables understanding. Architecture and training objective are tightly coupled.

Python Example

import numpy as np

n = 5  # sequence length

# Causal mask: lower triangle (including diagonal)
causal_mask = np.tril(np.ones((n, n)))
print("Causal mask:")
print(causal_mask)
# [[1, 0, 0, 0, 0],   token 0 sees only itself
#  [1, 1, 0, 0, 0],   token 1 sees tokens 0-1
#  [1, 1, 1, 0, 0],   ...
#  [1, 1, 1, 1, 0],
#  [1, 1, 1, 1, 1]]   token 4 sees all

# Bidirectional: all ones (no masking)
bidir_mask = np.ones((n, n))
print("Bidirectional mask:")
print(bidir_mask)

# Apply mask to attention scores
scores = np.random.randn(n, n)
masked = scores + (1 - causal_mask) * (-1e9)  # -inf for masked

Follow-up Questions

Can you mix causal and bidirectional attention in one model?

Yes. Prefix LM models (e.g., PaLM's prompt processing) use bidirectional attention over the prefix/prompt tokens and causal attention for generated tokens. Encoder-decoder models naturally combine both: the encoder is bidirectional and the decoder is causal, with cross-attention connecting them.

Does the causal mask hurt generation quality since tokens cannot see future context?

It constrains what the model knows at each step, but the training objective (next-token prediction) is designed precisely for this constraint. The model learns to make the best prediction given only past context. At inference time, this is exactly the setting we need: generate one token at a time, left to right.

Common Transformer Failure Modes

Common failures include attention diffusion on long contexts, positional degradation, context dilution from noisy prompts, hallucination when retrieval is weak, and unstable outputs from poor decoding settings. None mean the transformer is broken — they mean the system must manage its limitations.

💡 Understanding failure modes is what separates senior engineers from architecture quiz-takers. The model is a component in a system, and system design must compensate for component limitations.

Attention Diffusion

Attention spreads too thin over long contexts, losing focus on relevant information.

Positional Degradation

Performance drops at sequence lengths beyond training distribution.

Context Dilution

Noisy or irrelevant context in the prompt degrades output quality.

Hallucination

Model generates plausible but factually incorrect content when retrieval is absent or weak.

Decoding Instability

Poor temperature, top-k, or top-p settings cause repetitive, incoherent, or degenerate outputs.

Attention Diffusion

As context length increases, attention weights spread across more tokens, diluting the model's ability to focus on the most relevant information. This is particularly problematic for the "lost in the middle" phenomenon, where information placed in the middle of a long context is attended to less than information at the beginning or end.

Positional Degradation

Models trained on sequences up to length N often degrade at length N+1. Even with Topic 5: RoPE and its extensions, there is no guarantee that extrapolation will preserve quality. Testing at your target context length is essential.

Hallucination and Context Dilution

Hallucination is not random — it occurs when the model is asked to produce information beyond its training data or retrieval context. Context dilution happens when too much irrelevant information in the prompt causes the model to attend to noise rather than signal. Both are managed through system design:

Retrieval quality: Better retrieval reduces hallucination.
Prompt engineering: Shorter, more focused prompts reduce context dilution.
Grounding: Citation requirements and tool use force the model to reference sources.

Decoding Configuration

Setting	Too Low	Too High	Recommended
Temperature	Repetitive, dull	Incoherent, random	0.0-0.3 (factual), 0.7-1.0 (creative)
Top-p	Too deterministic	Low-probability tokens appear	0.9-0.95 for most tasks
Top-k	Limits diversity	Allows noise	40-100, or disable in favor of top-p

→ Senior candidates do not stop at architecture diagrams. They explain how transformer behavior interacts with token budgets, retrieval quality, training data, and serving constraints. That shows system-level understanding.

Python Example

# Demonstrate "lost in the middle" attention pattern
import numpy as np

def simulated_attention(seq_len, query_pos=-1):
    """Simulate attention pattern showing primacy/recency bias."""
    positions = np.arange(seq_len)

    # Attention tends toward start (primacy) and end (recency)
    primacy = np.exp(-positions / 20.0)
    recency = np.exp(-(seq_len - positions) / 20.0)
    middle_penalty = 0.3  # middle positions get less attention

    weights = primacy + recency + middle_penalty
    weights = weights / weights.sum()  # normalize
    return weights

weights = simulated_attention(100)
print("Start weight:",  weights[:5].sum().round(3))
print("Middle weight:", weights[40:60].sum().round(3))
print("End weight:",    weights[-5:].sum().round(3))

Follow-up Questions

How do you mitigate the "lost in the middle" problem?

Place the most important information at the beginning or end of the context. For RAG, re-order retrieved passages so the most relevant ones are first. Some systems duplicate key information. Recent models are being explicitly trained to attend uniformly, which is reducing but not eliminating this bias.

Is hallucination a transformer-specific problem?

Hallucination occurs in any generative model that learns statistical patterns rather than grounded facts. It is not transformer-specific, but the transformer's fluency makes hallucinations more convincing and therefore more dangerous. The mitigation is always system-level: retrieval, grounding, verification, and citation.

How do you debug transformer outputs in production?

Key diagnostic tools: log probability inspection (how confident was the model?), attention visualization (what did it focus on?), prompt ablation (which context parts affected the output?), and token-level analysis (was the issue in tokenization or generation?). Start broad, then narrow down.