The mathematical functions that turn raw vectors into the weighted attention mechanism at the heart of every transformer.
Softmax in Attention
The Formula
For a vector of raw scores z, softmax computes: softmax(z_i) = exp(z_i) / sum(exp(z_j)) for all j. The exponential ensures all outputs are positive. The division by the sum ensures they add to 1. Temperature T scales the logits before the exponential: softmax(z_i / T). Lower T sharpens the distribution (more confident); higher T flattens it (more uniform).
Why Softmax in Attention Specifically
In self-attention, the model computes Q * K^T to get similarity scores between every pair of tokens. Without softmax, these scores are unbounded real numbers. Softmax transforms them into a proper distribution, enabling the model to compute a weighted average of value vectors. This weighted average is what makes attention a "soft lookup" rather than a hard selection.
Numerical Stability
In practice, implementations subtract the maximum logit before exponentiating: exp(z_i - max(z)). This prevents overflow from large exponentials while producing identical results (the subtraction cancels in the numerator and denominator). This is why you see z - z.max() in every production softmax implementation.
See Topic 2: Dot Product in Attention for why the raw scores need scaling before softmax.
Python — Softmax from Scratch
import numpy as np
def softmax(logits, temperature=1.0):
"""
Numerically stable softmax with temperature scaling.
Lower temperature = sharper (more confident) distribution.
Higher temperature = flatter (more uniform) distribution.
"""
# Scale by temperature
scaled = np.array(logits) / temperature
# Subtract max for numerical stability (prevents overflow)
shifted = scaled - np.max(scaled)
# Exponentiate and normalize
exp_vals = np.exp(shifted)
return exp_vals / exp_vals.sum()
# Example: attention scores for 5 tokens
scores = [2.0, 5.0, 1.0, 0.5, 3.0]
print("T=1.0 (normal):", softmax(scores, 1.0))
print("T=0.5 (sharp): ", softmax(scores, 0.5))
print("T=2.0 (flat): ", softmax(scores, 2.0))
Why not use a simpler normalization like dividing by the sum?
What is the relationship between softmax temperature and model creativity?
What is Flash Attention and how does it relate to softmax?
The Dot Product in Self-Attention
Why Scaling Matters
Raw dot products grow proportionally to the vector dimension d_k. In a typical transformer with d_k = 128, random query-key pairs produce dot products whose variance scales as d_k. Without scaling, these large values push softmax into near-one-hot territory, causing vanishing gradients through the softmax Jacobian. Dividing by sqrt(d_k) keeps the input to softmax in a numerically stable range (Vaswani et al., 2017).
The Full Attention Equation
Bringing it together: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V. The dot product measures compatibility. The scaling prevents gradient problems. Softmax normalizes to a distribution. The result multiplies the value vectors, producing a context-dependent representation for each token.
Multi-Head Attention
Instead of a single large dot-product attention, transformers split Q, K, V into multiple heads, each attending with a smaller d_k. Each head can learn different notions of relevance (syntactic, semantic, positional). The outputs are concatenated and projected back. This improves both expressiveness and training stability.
Python — Scaled Dot-Product Attention
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Core attention mechanism used in every transformer.
Q, K, V: arrays of shape (seq_len, d_k)
Returns: weighted sum of V based on Q-K compatibility.
"""
d_k = Q.shape[-1]
# Step 1: Compute raw compatibility scores
scores = Q @ K.T # shape: (seq_len, seq_len)
# Step 2: Scale to prevent softmax saturation
scores = scores / np.sqrt(d_k) # critical for training stability
# Step 3: Normalize to get attention weights
exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
# Step 4: Weighted combination of values
return weights @ V # shape: (seq_len, d_k)
# Example with 4 tokens, dimension 8
np.random.seed(42)
Q = np.random.randn(4, 8)
K = np.random.randn(4, 8)
V = np.random.randn(4, 8)
output = scaled_dot_product_attention(Q, K, V)
print("Output shape:", output.shape)
Why not use cosine similarity instead of dot product?
How does multi-head attention improve over single-head?
How the training loop measures prediction quality and distributional distance — the signals that drive all parameter updates.
Cross-Entropy Loss
The Formula
For next-token prediction with one correct token: loss = -log(p_correct). When the model assigns p=0.9 to the correct token, loss is 0.105. At p=0.1, loss is 2.302. At p=0.01, loss is 4.605. The logarithmic scale means the penalty grows dramatically for low-confidence correct predictions, providing strong gradient signal where the model needs to learn most.
Why Cross-Entropy Wins
| Property | Why It Matters for LLMs |
|---|---|
| Pairs naturally with softmax | The softmax-cross-entropy combination has clean, efficient gradients |
| Probabilistic interpretation | Directly measures prediction quality as information-theoretic surprise |
| Strong gradient signal | Bad predictions produce large gradients — the model learns fastest where it is most wrong |
| Connection to perplexity | Perplexity = exp(avg cross-entropy), the standard LLM evaluation metric |
Perplexity Connection
Perplexity is the exponential of average cross-entropy loss: PPL = exp(avg_CE). A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options. Lower perplexity means better prediction. This is the most common metric for comparing language model quality.
See Topic 4: KL Divergence for how cross-entropy relates to distributional distance measures used in distillation and alignment.
Python — Cross-Entropy and Perplexity
import numpy as np
def cross_entropy_loss(predicted_probs, true_index):
"""
Compute cross-entropy loss for a single prediction.
predicted_probs: softmax output (sums to 1)
true_index: index of the correct token
"""
eps = 1e-12 # prevent log(0)
return -np.log(predicted_probs[true_index] + eps)
# Example: model predicts distribution over 3 tokens
pred = np.array([0.1, 0.8, 0.1]) # model's softmax output
true_token = 1 # correct token is index 1
loss = cross_entropy_loss(pred, true_token)
perplexity = np.exp(loss)
print(f"Loss: {loss:.4f}") # 0.2231
print(f"Perplexity: {perplexity:.4f}") # 1.2500
# Compare: what if model was wrong?
pred_bad = np.array([0.7, 0.1, 0.2])
loss_bad = cross_entropy_loss(pred_bad, true_token)
print(f"Bad loss: {loss_bad:.4f}") # 2.3026 -- much higher!
Why not use mean squared error instead of cross-entropy?
What is "good" perplexity for a language model?
How does label smoothing affect cross-entropy?
KL Divergence
The Formula
KL(P || Q) = sum(P(x) * log(P(x) / Q(x))) for all x. This measures the expected extra "cost" (in bits or nats) of encoding data from distribution P using a code optimized for distribution Q. When P and Q are identical, the cost is zero.
Why Asymmetry Matters
KL(P || Q) penalizes Q heavily wherever P has probability mass but Q does not. In RLHF, this means KL(policy || reference) prevents the fine-tuned policy from straying too far from the base model — it acts as a leash. The direction you choose determines which errors get penalized most.
KL in Practice
| Application | P (reference) | Q (model) | Purpose |
|---|---|---|---|
| Distillation | Teacher output | Student output | Train student to match teacher |
| RLHF penalty | Base model output | Policy model output | Prevent reward hacking |
| VAE training | Standard normal | Encoder output | Regularize latent space |
See Topic 3: Cross-Entropy Loss for the relationship: cross-entropy = entropy + KL divergence.
Python — Cross-Entropy and KL Divergence
import numpy as np
# Two probability distributions over 3 tokens
target = np.array([0.0, 1.0, 0.0]) # true: token 1 is correct
pred = np.array([0.1, 0.8, 0.1]) # model's prediction
teacher = np.array([0.05, 0.9, 0.05]) # teacher model's softer output
eps = 1e-12 # numerical stability
# Cross-entropy: how well does pred match target?
cross_entropy = -(target * np.log(pred + eps)).sum()
# KL divergence: how far is pred from teacher?
# KL(teacher || pred) = sum(teacher * log(teacher / pred))
kl = (teacher * (np.log(teacher + eps) - np.log(pred + eps))).sum()
print(f"cross_entropy = {cross_entropy:.4f}") # 0.2231
print(f"kl_divergence = {kl:.4f}") # 0.0144
print(f"Note: KL is small because teacher and pred are similar")
What is the KL penalty in RLHF and why is it needed?
When would you use Jensen-Shannon divergence instead of KL?
How gradients flow through transformers — from embedding lookups through vector-valued layers to the chain rule that ties it all together.
Gradients for Embeddings
How It Works
An embedding layer is mathematically equivalent to multiplying a one-hot vector by the embedding matrix. The one-hot vector selects exactly one row. During backpropagation, the gradient with respect to the embedding matrix is also sparse: it has non-zero entries only for the rows corresponding to tokens that appeared in the batch.
Why Sparsity Matters
With a vocabulary of 100,000 tokens and a batch of 2,048 tokens, fewer than 2% of embedding rows receive gradient updates on any given step. This has practical consequences:
- Rare tokens get updated infrequently, so their embeddings are less refined
- Optimizers must handle sparse gradients correctly (Adam does this naturally via per-parameter moments)
- Gradient accumulation across batches helps rare tokens get enough updates
Frozen vs Trainable Embeddings
In PEFT methods, embeddings are typically frozen (no gradient updates). This preserves the base model's token representations. In full fine-tuning, embedding updates can help with domain-specific terminology but also risk destabilizing well-learned representations for common tokens.
Python — Embedding Gradient Sparsity
import torch
import torch.nn as nn
# Create a small embedding table: 10 tokens, 4-dimensional embeddings
embed = nn.Embedding(10, 4)
# Forward pass: look up tokens [2, 5, 2]
token_ids = torch.tensor([2, 5, 2])
output = embed(token_ids)
# Simulate a loss and backward pass
loss = output.sum()
loss.backward()
# Only rows 2 and 5 received gradients
print("Gradient for each row:")
for i in range(10):
grad = embed.weight.grad[i]
has_grad = grad.abs().sum().item() > 0
print(f" Token {i}: {'UPDATED' if has_grad else 'skipped':} grad={grad.tolist()}")
# Output: Only tokens 2 and 5 show non-zero gradients
How do embeddings differ from a regular linear layer?
Embedding(token_id) is the same as one_hot(token_id) @ W. The difference is implementation efficiency. Embedding layers use an index lookup (O(1) per token) instead of a matrix multiply (O(vocab_size * dim)), making them vastly faster for large vocabularies.Can you freeze just the embeddings during fine-tuning?
The Jacobian Matrix
From Scalars to Vectors
For a scalar function f(x), the derivative df/dx tells you the rate of change. For a vector-valued function f: R^n -> R^m, the Jacobian matrix J has shape m x n, where each entry J[i,j] = df_i/dx_j. This generalizes the single-variable derivative to the multivariate case.
Why It Matters in Deep Learning
Each layer in a neural network is a vector-valued function. Backpropagation needs to pass gradients backward through these functions. The chain rule for vector functions involves multiplying Jacobian matrices: dL/dx = J^T * dL/df. This is how gradient information flows from the loss back through every layer.
Practical Implications
- Jacobian size — For a layer mapping R^4096 to R^4096, the full Jacobian has 16.7M entries. Storing it is impractical, which is why frameworks compute Jacobian-vector products instead.
- Jacobian conditioning — If the Jacobian has very large or very small singular values, gradients explode or vanish. This is the mathematical root of training instability in deep networks.
- Softmax Jacobian — The Jacobian of softmax has a specific structure that makes backpropagation through attention efficient.
Python — Computing the Jacobian
import torch
def compute_jacobian(f, x):
"""
Compute the full Jacobian matrix of f at point x.
f: function R^n -> R^m
x: input tensor of shape (n,)
Returns: Jacobian matrix of shape (m, n)
"""
x = x.requires_grad_(True)
y = f(x)
m = y.shape[0]
n = x.shape[0]
jacobian = torch.zeros(m, n)
for i in range(m):
# Gradient of the i-th output w.r.t. all inputs
grad = torch.autograd.grad(
y[i], x, retain_graph=True, create_graph=True
)[0]
jacobian[i] = grad
return jacobian
# Example: softmax Jacobian (3-dimensional)
x = torch.tensor([2.0, 1.0, 0.5])
J = compute_jacobian(torch.nn.functional.softmax, x)
print("Softmax Jacobian:\n", J)
Why not compute the full Jacobian during training?
m x n entries per layer. For transformer layers with 4096-dimensional inputs and outputs, that is 16.7 million entries per layer. Instead, backpropagation computes Jacobian-vector products (JVPs) directly, which gives you the gradient of the scalar loss w.r.t. inputs without ever materializing the full matrix. This is the key insight that makes training large models practical.How does the Jacobian relate to gradient vanishing/exploding?
The Chain Rule and Backpropagation
Each factor is a local derivative (Jacobian) computed during the backward pass
The Core Idea
For a composition L = loss(f2(f1(x))), the chain rule gives: dL/dx = dL/df2 * df2/df1 * df1/dx. Each factor is a local derivative that depends only on the function at that layer and its current input. Backpropagation computes these factors right-to-left, reusing cached forward-pass values.
Why Backpropagation Is Efficient
A naive approach to computing gradients would re-evaluate the entire network for each parameter. Backpropagation instead makes one forward pass (caching intermediate values) and one backward pass (multiplying local derivatives). The computational cost of the backward pass is roughly 2x the forward pass, regardless of network depth.
Practical Consequences
| Concept | What Happens | Engineering Impact |
|---|---|---|
| Activation checkpointing | Drop cached activations, recompute during backward | Trades compute for memory (critical for large models) |
| Gradient accumulation | Sum gradients over multiple mini-batches before updating | Simulates larger batch sizes with less GPU memory |
| Mixed precision | Forward in fp16/bf16, accumulate gradients in fp32 | 2x memory reduction, faster compute |
The chain rule is also why residual connections are critical — they create additive shortcut paths through the chain, preventing gradient signal from vanishing over many multiplications.
Python — Manual Backpropagation
import numpy as np
# Manual chain rule through a 2-layer network
# y = relu(x @ W1) @ W2
# loss = 0.5 * (y - target)^2
np.random.seed(42)
x = np.array([1.0, 2.0]) # input
W1 = np.random.randn(2, 3) # layer 1 weights
W2 = np.random.randn(3, 1) # layer 2 weights
target = np.array([1.0]) # target output
# --- Forward pass (cache intermediates) ---
z1 = x @ W1 # pre-activation
h1 = np.maximum(0, z1) # ReLU activation
y = h1 @ W2 # output
loss = 0.5 * (y - target) ** 2 # MSE loss
# --- Backward pass (chain rule, right to left) ---
dL_dy = y - target # dL/dy
dL_dW2 = h1.reshape(-1, 1) @ dL_dy.reshape(1, -1) # dL/dW2
dL_dh1 = dL_dy @ W2.T # chain: dL/dh1
dL_dz1 = dL_dh1 * (z1 > 0) # ReLU derivative
dL_dW1 = x.reshape(-1, 1) @ dL_dz1.reshape(1, -1) # dL/dW1
print(f"Loss: {loss.item():.4f}")
print(f"dL/dW1 shape: {dL_dW1.shape}")
print(f"dL/dW2 shape: {dL_dW2.shape}")
What is activation checkpointing and why does it help?
Why is the backward pass roughly 2x the cost of the forward pass?
How does gradient accumulation work?
The architectural and mathematical mechanisms that keep deep transformer training stable — activation functions, residual paths, normalization, and dimensionality reduction.
The ReLU Derivative and Activation Functions
Why ReLU Changed Deep Learning
Before ReLU, networks used sigmoid and tanh activations. Both saturate at extreme values, producing derivatives near zero. When you multiply many near-zero derivatives via the chain rule, gradients vanish exponentially. ReLU's derivative of exactly 1 for positive inputs avoids this: gradients pass through unchanged.
Activation Function Comparison
| Function | Formula | Derivative | Used In |
|---|---|---|---|
| ReLU | max(0, x) | 0 or 1 | Earlier transformers, CNNs |
| GELU | x * Phi(x) | Smooth curve | GPT, BERT, most modern LLMs |
| SiLU/Swish | x * sigmoid(x) | Smooth, non-monotonic | LLaMA, PaLM |
| Sigmoid | 1/(1+exp(-x)) | sigma*(1-sigma) | Legacy, gating mechanisms |
Modern Transformers: GELU and SiLU
Most current LLMs use GELU (Gaussian Error Linear Unit) or SiLU/Swish instead of plain ReLU. These smooth approximations of ReLU have non-zero gradients for slightly negative inputs, which helps with optimization. The key insight remains the same: avoid saturating derivatives that kill gradient flow.
The "dead neuron" problem of ReLU (neurons stuck at zero output) is largely avoided by GELU and SiLU, which is one reason modern architectures prefer them.
Python — Activation Functions and Derivatives
import numpy as np
def relu(x):
"""ReLU: simple, fast, but can produce dead neurons."""
return np.maximum(0, x)
def relu_derivative(x):
"""Binary derivative: 1 if x > 0, else 0."""
return (x > 0).astype(float)
def gelu(x):
"""GELU: used in GPT, BERT. Smooth approximation of ReLU."""
return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
# Compare gradient flow through 50 layers
x = np.array([0.5])
relu_grad = 1.0
sigmoid_grad = 1.0
for _ in range(50):
relu_grad *= relu_derivative(x)[0] # stays 1.0
sig = 1 / (1 + np.exp(-x[0]))
sigmoid_grad *= sig * (1 - sig) # shrinks fast
print(f"After 50 layers:")
print(f" ReLU gradient: {relu_grad:.6f}") # 1.000000
print(f" Sigmoid gradient: {sigmoid_grad:.2e}") # ~1e-38 (vanished!)
What is the "dead neuron" problem in ReLU?
Why did transformers switch from ReLU to GELU?
Residual Connections and Normalization
Residual Connections: The Gradient Highway
A residual connection computes output = x + F(x) instead of output = F(x). During backpropagation, the gradient splits: dL/dx = dL/doutput * (1 + dF/dx). The +1 term is the residual path — it passes gradient through unchanged regardless of what F does. Even if dF/dx approaches zero (vanishing gradient through the layer), the gradient still flows via the identity path.
Layer Normalization
LayerNorm normalizes activations across the feature dimension to have zero mean and unit variance. This prevents activations from drifting to extreme values across layers, keeping the inputs to each layer in a well-conditioned range. Without normalization, the accumulation of residual additions can cause activations to grow unboundedly.
Pre-Norm vs Post-Norm
| Variant | Order | Behavior |
|---|---|---|
| Post-Norm (original) | Attention → Add → Norm | Harder to train deep models without warmup |
| Pre-Norm (modern) | Norm → Attention → Add | More stable training, preferred in most LLMs |
Why Both Are Needed Together
Residual connections alone let gradients flow but can cause activation explosion. Normalization alone stabilizes activations but does not fix gradient vanishing through complex layers. Together, they create a training regime where 96-layer transformers train as reliably as 6-layer ones — a feat that seemed impossible before these techniques.
See Topic 7: Chain Rule and Backpropagation for why gradient flow matters, and Topic 6: The Jacobian for the mathematical framework.
Python — Transformer Block with Residual + Norm
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
"""
A single transformer block with pre-norm residual connections.
This is the standard architecture in modern LLMs (GPT, LLaMA).
"""
def __init__(self, d_model, n_heads):
super().__init__()
# Pre-norm: normalize BEFORE the sublayer
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.ff = nn.Sequential(
nn.Linear(d_model, 4 * d_model),
nn.GELU(), # modern activation
nn.Linear(4 * d_model, d_model),
)
def forward(self, x):
# Residual connection 1: x + Attention(Norm(x))
normed = self.norm1(x)
attn_out, _ = self.attn(normed, normed, normed)
x = x + attn_out # gradient highway
# Residual connection 2: x + FF(Norm(x))
x = x + self.ff(self.norm2(x)) # gradient highway
return x
What is RMSNorm and why do some models prefer it?
Can you have too many residual connections?
Eigenvalues, Eigenvectors, and Dimensionality Reduction
What Eigenvectors and Eigenvalues Are
For a square matrix A, an eigenvector v satisfies A * v = lambda * v. The matrix only scales the eigenvector, without changing its direction. The scaling factor lambda is the eigenvalue. In PCA, the covariance matrix's eigenvectors point along the principal axes of variation in the data, and eigenvalues tell you how spread out the data is along each axis.
Connection to LLM Practice
| Application | How Eigen-Analysis Appears |
|---|---|
| LoRA | Low-rank adapters work because weight updates are approximately low-rank — most change concentrates in a few directions (large eigenvalues) |
| Embedding visualization | PCA projects high-dimensional token embeddings into 2D/3D for visualization |
| Training stability | The eigenvalues of the Hessian (curvature matrix) determine whether optimization is stable |
| Compression | SVD (which uses eigen-decomposition) is the mathematical backbone of model compression |
Why Low Rank Works for Adaptation
When you fine-tune a model, the weight updates delta_W tend to be approximately low-rank: most of the change concentrates in a small subspace. This is why LoRA (see the chapter on fine-tuning) can capture most of the adaptation quality with rank 16 or 32 out of 4096 dimensions. The eigenvalue spectrum of typical weight updates drops off quickly — the top few directions carry most of the information.
PCA in One Paragraph
Principal Component Analysis computes the eigenvectors of the data's covariance matrix, sorts them by eigenvalue, and keeps the top k. Projecting data onto these k vectors gives a k-dimensional representation that preserves as much variance as possible. In LLM work, you rarely apply PCA inside the model, but the concept explains why lower-dimensional projections and latent spaces can retain useful structure.
Python — PCA and Eigenvalues
import numpy as np
# Generate some 2D data with clear principal directions
np.random.seed(42)
n = 200
data = np.random.randn(n, 2) @ np.array([[3, 1], [1, 0.5]])
# Step 1: Center the data
centered = data - data.mean(axis=0)
# Step 2: Compute covariance matrix
cov = (centered.T @ centered) / (n - 1)
# Step 3: Eigen-decomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Variance explained
total_var = eigenvalues.sum()
for i, ev in enumerate(eigenvalues):
print(f"PC{i+1}: eigenvalue={ev:.3f}, variance explained={ev/total_var:.1%}")
# In LLM context: if eigenvalues drop off quickly,
# low-rank approximations (like LoRA) capture most of the signal