Ch 14: Optimization & Math Foundations for Language Models

Attention Math

The mathematical functions that turn raw vectors into the weighted attention mechanism at the heart of every transformer.

Softmax in Attention

Softmax converts raw similarity scores into a normalized probability distribution over tokens. It makes attention differentiable and ensures weights sum to one, allowing the model to blend value vectors proportionally instead of making a hard one-token choice.

💡 Softmax is the voting system of attention — it turns raw "how relevant is this?" scores into percentage-based weights that must add up to 100%.

Enter raw logits (attention scores) and adjust temperature:

Temperature: 1.0

The Formula

For a vector of raw scores z, softmax computes: softmax(z_i) = exp(z_i) / sum(exp(z_j)) for all j. The exponential ensures all outputs are positive. The division by the sum ensures they add to 1. Temperature T scales the logits before the exponential: softmax(z_i / T). Lower T sharpens the distribution (more confident); higher T flattens it (more uniform).

Why Softmax in Attention Specifically

In self-attention, the model computes Q * K^T to get similarity scores between every pair of tokens. Without softmax, these scores are unbounded real numbers. Softmax transforms them into a proper distribution, enabling the model to compute a weighted average of value vectors. This weighted average is what makes attention a "soft lookup" rather than a hard selection.

Numerical Stability

In practice, implementations subtract the maximum logit before exponentiating: exp(z_i - max(z)). This prevents overflow from large exponentials while producing identical results (the subtraction cancels in the numerator and denominator). This is why you see z - z.max() in every production softmax implementation.

See Topic 2: Dot Product in Attention for why the raw scores need scaling before softmax.

→ Softmax normalizes attention scores into a probability distribution — making attention differentiable, interpretable, and stable enough to train at scale.

Python — Softmax from Scratch

import numpy as np

def softmax(logits, temperature=1.0):
    """
    Numerically stable softmax with temperature scaling.
    Lower temperature = sharper (more confident) distribution.
    Higher temperature = flatter (more uniform) distribution.
    """
    # Scale by temperature
    scaled = np.array(logits) / temperature

    # Subtract max for numerical stability (prevents overflow)
    shifted = scaled - np.max(scaled)

    # Exponentiate and normalize
    exp_vals = np.exp(shifted)
    return exp_vals / exp_vals.sum()

# Example: attention scores for 5 tokens
scores = [2.0, 5.0, 1.0, 0.5, 3.0]

print("T=1.0 (normal):", softmax(scores, 1.0))
print("T=0.5 (sharp): ", softmax(scores, 0.5))
print("T=2.0 (flat):  ", softmax(scores, 2.0))

Follow-up Questions

Why not use a simpler normalization like dividing by the sum?

Simple normalization (z_i / sum(z)) fails because raw scores can be negative, which would produce negative "probabilities." The exponential in softmax maps all values to positive numbers first. It also has a useful gradient structure: the Jacobian of softmax is well-defined and easy to backpropagate through, which is critical for training.

What is the relationship between softmax temperature and model creativity?

At inference time, temperature controls how "peaked" the next-token distribution is. T < 1 concentrates probability on the top tokens (more deterministic, less creative). T > 1 spreads probability more evenly (more random, more creative). T = 0 converges to argmax (greedy decoding). This is the same mathematical mechanism as in attention, just applied at the output layer.

What is Flash Attention and how does it relate to softmax?

Flash Attention computes the exact same softmax-based attention but reorganizes the computation to be more memory-efficient. It processes attention in tiles, computing partial softmax results and combining them using the log-sum-exp trick. The mathematical output is identical, but memory usage drops from O(n^2) to O(n), enabling much longer context windows.

The Dot Product in Self-Attention

The dot product provides a fast measure of alignment between two vectors. In attention, the query-key dot product says "how relevant does this other token look to the current token?" Larger values imply stronger alignment and therefore larger attention weights after softmax.

💡 The dot product is a compatibility test. Two vectors pointing in the same direction score high; orthogonal vectors score zero. It is the simplest way to ask "do these two things match?"

Query (q)

0.8

0.3

0.6

0.1

Key (k)

0.7

0.5

0.4

0.2

0.97

raw score

Scaled score = raw / sqrt(d_k) = 0.97 / sqrt(4) = 0.485

Scaling prevents large dot products from pushing softmax into saturation (see Topic 1)

Why Scaling Matters

Raw dot products grow proportionally to the vector dimension d_k. In a typical transformer with d_k = 128, random query-key pairs produce dot products whose variance scales as d_k. Without scaling, these large values push softmax into near-one-hot territory, causing vanishing gradients through the softmax Jacobian. Dividing by sqrt(d_k) keeps the input to softmax in a numerically stable range (Vaswani et al., 2017).

The Full Attention Equation

Bringing it together: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V. The dot product measures compatibility. The scaling prevents gradient problems. Softmax normalizes to a distribution. The result multiplies the value vectors, producing a context-dependent representation for each token.

Multi-Head Attention

Instead of a single large dot-product attention, transformers split Q, K, V into multiple heads, each attending with a smaller d_k. Each head can learn different notions of relevance (syntactic, semantic, positional). The outputs are concatenated and projected back. This improves both expressiveness and training stability.

→ The dot product is the cheapest meaningful similarity measure in high-dimensional space. Scaling by sqrt(d_k) is essential — without it, large dimensions push softmax into gradient-killing saturation.

Python — Scaled Dot-Product Attention

import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """
    Core attention mechanism used in every transformer.
    Q, K, V: arrays of shape (seq_len, d_k)
    Returns: weighted sum of V based on Q-K compatibility.
    """
    d_k = Q.shape[-1]

    # Step 1: Compute raw compatibility scores
    scores = Q @ K.T                  # shape: (seq_len, seq_len)

    # Step 2: Scale to prevent softmax saturation
    scores = scores / np.sqrt(d_k)    # critical for training stability

    # Step 3: Normalize to get attention weights
    exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

    # Step 4: Weighted combination of values
    return weights @ V                 # shape: (seq_len, d_k)

# Example with 4 tokens, dimension 8
np.random.seed(42)
Q = np.random.randn(4, 8)
K = np.random.randn(4, 8)
V = np.random.randn(4, 8)
output = scaled_dot_product_attention(Q, K, V)
print("Output shape:", output.shape)

Follow-up Questions

Why not use cosine similarity instead of dot product?

Cosine similarity normalizes by vector magnitude, which removes the model's ability to use vector length as a signal. The dot product preserves magnitude information, allowing the model to learn that some tokens should have stronger or weaker attention regardless of direction. The scaling by sqrt(d_k) achieves stability without discarding magnitude.

How does multi-head attention improve over single-head?

Each head operates in a smaller subspace and can learn a different type of relationship. Empirically, different heads specialize: some track syntax (subject-verb agreement), others track semantics (coreference), others track position. A single large head conflates all these signals. The computational cost is the same since the total dimension stays constant.

Loss & Divergence

How the training loop measures prediction quality and distributional distance — the signals that drive all parameter updates.

Cross-Entropy Loss

Cross-entropy measures how well the model's predicted probability distribution matches the true target. In next-token prediction, it penalizes the model when it assigns too little probability to the correct token — the stronger the penalty, the more the model needs to learn.

💡 Cross-entropy is a surprise meter. If the model is confident about the right answer, surprise (loss) is low. If it assigns the right answer low probability, surprise is high. Training minimizes surprise.

Adjust the model's predicted probability for the correct token:

P(correct): 0.70

0.36

Cross-Entropy Loss = -log(P(correct))

The Formula

For next-token prediction with one correct token: loss = -log(p_correct). When the model assigns p=0.9 to the correct token, loss is 0.105. At p=0.1, loss is 2.302. At p=0.01, loss is 4.605. The logarithmic scale means the penalty grows dramatically for low-confidence correct predictions, providing strong gradient signal where the model needs to learn most.

Why Cross-Entropy Wins

Property	Why It Matters for LLMs
Pairs naturally with softmax	The softmax-cross-entropy combination has clean, efficient gradients
Probabilistic interpretation	Directly measures prediction quality as information-theoretic surprise
Strong gradient signal	Bad predictions produce large gradients — the model learns fastest where it is most wrong
Connection to perplexity	Perplexity = exp(avg cross-entropy), the standard LLM evaluation metric

Perplexity Connection

Perplexity is the exponential of average cross-entropy loss: PPL = exp(avg_CE). A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options. Lower perplexity means better prediction. This is the most common metric for comparing language model quality.

See Topic 4: KL Divergence for how cross-entropy relates to distributional distance measures used in distillation and alignment.

→ Cross-entropy is the universal LLM loss because it gives the right gradient signal: penalize proportionally to how surprised the model is by the correct answer.

Python — Cross-Entropy and Perplexity

import numpy as np

def cross_entropy_loss(predicted_probs, true_index):
    """
    Compute cross-entropy loss for a single prediction.
    predicted_probs: softmax output (sums to 1)
    true_index: index of the correct token
    """
    eps = 1e-12  # prevent log(0)
    return -np.log(predicted_probs[true_index] + eps)

# Example: model predicts distribution over 3 tokens
pred = np.array([0.1, 0.8, 0.1])  # model's softmax output
true_token = 1                      # correct token is index 1

loss = cross_entropy_loss(pred, true_token)
perplexity = np.exp(loss)

print(f"Loss:       {loss:.4f}")       # 0.2231
print(f"Perplexity: {perplexity:.4f}") # 1.2500

# Compare: what if model was wrong?
pred_bad = np.array([0.7, 0.1, 0.2])
loss_bad = cross_entropy_loss(pred_bad, true_token)
print(f"Bad loss:   {loss_bad:.4f}")   # 2.3026 -- much higher!

Follow-up Questions

Why not use mean squared error instead of cross-entropy?

MSE measures distance between predicted and true probability values. Cross-entropy measures information-theoretic surprise. For probability distributions, cross-entropy produces stronger gradients when the model is confidently wrong (e.g., predicting 0.01 for the correct token). MSE's gradients would be small in the same situation, leading to slower learning.

What is "good" perplexity for a language model?

It depends on the data. On standard English text (Penn Treebank), state-of-the-art models achieve perplexity around 15–25. On broader web text (The Pile), perplexity ranges from 5–15 for large models. Lower is better, but comparing perplexity across different datasets is meaningless — it only makes sense as a relative metric within the same evaluation set.

How does label smoothing affect cross-entropy?

Label smoothing replaces the hard target (100% on correct token) with a soft target (e.g., 90% on correct, 10% spread across others). This prevents the model from becoming overconfident and improves generalization. Mathematically, it mixes cross-entropy with a uniform-distribution penalty, acting as a regularizer.

KL Divergence

KL divergence measures how one probability distribution differs from another. In LLM work it appears when distilling a student from a teacher, constraining policy updates in RLHF, or comparing a model's output to a reference distribution. It is not symmetric — the direction matters.

💡 KL divergence is a one-way penalty: "how much extra surprise do you get from using distribution Q when the truth is P?" It is a leash, not a distance — the pull is stronger in one direction.

Used In

• Knowledge distillation (teacher vs student)

• RLHF KL penalty (policy vs reference)

• VAE latent regularization

• Distribution matching in general

Key Properties

• Not symmetric: KL(P||Q) ≠ KL(Q||P)

• Always >= 0 (Gibbs inequality)

• = 0 only when P = Q exactly

• Related to cross-entropy: CE(P,Q) = H(P) + KL(P||Q)

The Formula

KL(P || Q) = sum(P(x) * log(P(x) / Q(x))) for all x. This measures the expected extra "cost" (in bits or nats) of encoding data from distribution P using a code optimized for distribution Q. When P and Q are identical, the cost is zero.

Why Asymmetry Matters

KL(P || Q) penalizes Q heavily wherever P has probability mass but Q does not. In RLHF, this means KL(policy || reference) prevents the fine-tuned policy from straying too far from the base model — it acts as a leash. The direction you choose determines which errors get penalized most.

KL in Practice

Application	P (reference)	Q (model)	Purpose
Distillation	Teacher output	Student output	Train student to match teacher
RLHF penalty	Base model output	Policy model output	Prevent reward hacking
VAE training	Standard normal	Encoder output	Regularize latent space

See Topic 3: Cross-Entropy Loss for the relationship: cross-entropy = entropy + KL divergence.

→ KL divergence is not a distance but a directional penalty. It keeps distributions close in a controlled way — essential for distillation, alignment, and policy optimization.

Python — Cross-Entropy and KL Divergence

import numpy as np

# Two probability distributions over 3 tokens
target  = np.array([0.0, 1.0, 0.0])   # true: token 1 is correct
pred    = np.array([0.1, 0.8, 0.1])   # model's prediction
teacher = np.array([0.05, 0.9, 0.05]) # teacher model's softer output

eps = 1e-12  # numerical stability

# Cross-entropy: how well does pred match target?
cross_entropy = -(target * np.log(pred + eps)).sum()

# KL divergence: how far is pred from teacher?
# KL(teacher || pred) = sum(teacher * log(teacher / pred))
kl = (teacher * (np.log(teacher + eps) - np.log(pred + eps))).sum()

print(f"cross_entropy = {cross_entropy:.4f}")  # 0.2231
print(f"kl_divergence = {kl:.4f}")           # 0.0144
print(f"Note: KL is small because teacher and pred are similar")

Follow-up Questions

What is the KL penalty in RLHF and why is it needed?

During RLHF, the model is optimized to maximize a reward signal. Without a constraint, it would exploit quirks in the reward model (reward hacking). The KL penalty penalizes the policy for drifting too far from the base model, keeping outputs reasonable. The coefficient on the KL term controls the trade-off: higher penalty = more conservative outputs; lower = more reward-driven.

When would you use Jensen-Shannon divergence instead of KL?

Jensen-Shannon divergence (JSD) is the symmetric, bounded version: JSD(P,Q) = 0.5 * KL(P||M) + 0.5 * KL(Q||M), where M = 0.5*(P+Q). Use JSD when you need a true distance metric (symmetric, satisfies triangle inequality) or when comparing two model outputs without a clear "reference" direction. GAN training historically used JSD.

Gradient Mechanics

How gradients flow through transformers — from embedding lookups through vector-valued layers to the chain rule that ties it all together.

Gradients for Embeddings

Each token lookup selects a row from the embedding matrix. During backpropagation, gradients flow only into the rows that were used in the forward pass. The embedding table is a trainable parameter matrix with sparse updates — only tokens in the batch receive direct gradient signal.

💡 The embedding table is a dictionary where entries only get edited when they are actually looked up. Unused entries sleep through the training step.

Embedding matrix (rows = tokens). Only rows used in this batch get gradient updates:

How It Works

An embedding layer is mathematically equivalent to multiplying a one-hot vector by the embedding matrix. The one-hot vector selects exactly one row. During backpropagation, the gradient with respect to the embedding matrix is also sparse: it has non-zero entries only for the rows corresponding to tokens that appeared in the batch.

Why Sparsity Matters

With a vocabulary of 100,000 tokens and a batch of 2,048 tokens, fewer than 2% of embedding rows receive gradient updates on any given step. This has practical consequences:

Rare tokens get updated infrequently, so their embeddings are less refined
Optimizers must handle sparse gradients correctly (Adam does this naturally via per-parameter moments)
Gradient accumulation across batches helps rare tokens get enough updates

Frozen vs Trainable Embeddings

In PEFT methods, embeddings are typically frozen (no gradient updates). This preserves the base model's token representations. In full fine-tuning, embedding updates can help with domain-specific terminology but also risk destabilizing well-learned representations for common tokens.

→ Embedding gradients are sparse — only tokens present in the batch get updated. This means rare tokens learn slowly and need more data or longer training to develop good representations.

Python — Embedding Gradient Sparsity

import torch
import torch.nn as nn

# Create a small embedding table: 10 tokens, 4-dimensional embeddings
embed = nn.Embedding(10, 4)

# Forward pass: look up tokens [2, 5, 2]
token_ids = torch.tensor([2, 5, 2])
output = embed(token_ids)

# Simulate a loss and backward pass
loss = output.sum()
loss.backward()

# Only rows 2 and 5 received gradients
print("Gradient for each row:")
for i in range(10):
    grad = embed.weight.grad[i]
    has_grad = grad.abs().sum().item() > 0
    print(f"  Token {i}: {'UPDATED' if has_grad else 'skipped':}  grad={grad.tolist()}")
# Output: Only tokens 2 and 5 show non-zero gradients

Follow-up Questions

How do embeddings differ from a regular linear layer?

Mathematically, they are equivalent: Embedding(token_id) is the same as one_hot(token_id) @ W. The difference is implementation efficiency. Embedding layers use an index lookup (O(1) per token) instead of a matrix multiply (O(vocab_size * dim)), making them vastly faster for large vocabularies.

Can you freeze just the embeddings during fine-tuning?

Yes, and it is a common practice. Freezing embeddings preserves the model's existing token representations while allowing higher layers to adapt. This is especially useful when your fine-tuning data is small — updating embeddings with limited data can worsen representations for common tokens that were well-learned during pre-training.

The Jacobian Matrix

The Jacobian matrix contains all partial derivatives of a vector-valued function with respect to its inputs. In deep learning, it tells you how small changes in each input dimension affect each output dimension — essential for correctly backpropagating gradients through vector transformations.

💡 If a single derivative is "how much does the output wiggle when I wiggle the input," the Jacobian is the complete wiggle map for every input-output pair simultaneously.

Input (x)

→

Jacobian J

df1/dx1

df1/dx2

df1/dx3

df2/dx1

df2/dx2

df2/dx3

→

Output f(x)

From Scalars to Vectors

For a scalar function f(x), the derivative df/dx tells you the rate of change. For a vector-valued function f: R^n -> R^m, the Jacobian matrix J has shape m x n, where each entry J[i,j] = df_i/dx_j. This generalizes the single-variable derivative to the multivariate case.

Why It Matters in Deep Learning

Each layer in a neural network is a vector-valued function. Backpropagation needs to pass gradients backward through these functions. The chain rule for vector functions involves multiplying Jacobian matrices: dL/dx = J^T * dL/df. This is how gradient information flows from the loss back through every layer.

Practical Implications

Jacobian size — For a layer mapping R^4096 to R^4096, the full Jacobian has 16.7M entries. Storing it is impractical, which is why frameworks compute Jacobian-vector products instead.
Jacobian conditioning — If the Jacobian has very large or very small singular values, gradients explode or vanish. This is the mathematical root of training instability in deep networks.
Softmax Jacobian — The Jacobian of softmax has a specific structure that makes backpropagation through attention efficient.

→ The Jacobian generalizes derivatives to vector functions. Backpropagation is chain-multiplying Jacobians — their conditioning determines whether gradients survive the journey through deep networks.

Python — Computing the Jacobian

import torch

def compute_jacobian(f, x):
    """
    Compute the full Jacobian matrix of f at point x.
    f: function R^n -> R^m
    x: input tensor of shape (n,)
    Returns: Jacobian matrix of shape (m, n)
    """
    x = x.requires_grad_(True)
    y = f(x)
    m = y.shape[0]
    n = x.shape[0]

    jacobian = torch.zeros(m, n)
    for i in range(m):
        # Gradient of the i-th output w.r.t. all inputs
        grad = torch.autograd.grad(
            y[i], x, retain_graph=True, create_graph=True
        )[0]
        jacobian[i] = grad

    return jacobian

# Example: softmax Jacobian (3-dimensional)
x = torch.tensor([2.0, 1.0, 0.5])
J = compute_jacobian(torch.nn.functional.softmax, x)
print("Softmax Jacobian:\n", J)

Follow-up Questions

Why not compute the full Jacobian during training?

The full Jacobian has m x n entries per layer. For transformer layers with 4096-dimensional inputs and outputs, that is 16.7 million entries per layer. Instead, backpropagation computes Jacobian-vector products (JVPs) directly, which gives you the gradient of the scalar loss w.r.t. inputs without ever materializing the full matrix. This is the key insight that makes training large models practical.

How does the Jacobian relate to gradient vanishing/exploding?

When you multiply Jacobians across many layers (via the chain rule), the product's magnitude depends on the singular values of each Jacobian. If singular values are consistently < 1, the product shrinks exponentially (vanishing). If > 1, it grows exponentially (exploding). Residual connections and normalization are designed to keep these products well-conditioned.

The Chain Rule and Backpropagation

Neural networks are compositions of many functions. The chain rule computes the derivative of the whole composition by multiplying local derivatives step by step from output to input. Backpropagation is the efficient bookkeeping of that repeated chain-rule application.

💡 The chain rule is the blame game. If the final answer is wrong, each layer passes back its share of responsibility for the error, multiplied through the chain of functions that produced the result.

Forward pass (left to right) / Backward pass (right to left, in pink)

Input x

tokens

→

← dL/dh1

Layer 1

h1 = f1(x)

→

← dL/dh2

Layer 2

h2 = f2(h1)

→

← dL/dy

Loss L

L = loss(y)

dL/dx = dL/dy × dy/dh2 × dh2/dh1 × dh1/dx
Each factor is a local derivative (Jacobian) computed during the backward pass

The Core Idea

For a composition L = loss(f2(f1(x))), the chain rule gives: dL/dx = dL/df2 * df2/df1 * df1/dx. Each factor is a local derivative that depends only on the function at that layer and its current input. Backpropagation computes these factors right-to-left, reusing cached forward-pass values.

Why Backpropagation Is Efficient

A naive approach to computing gradients would re-evaluate the entire network for each parameter. Backpropagation instead makes one forward pass (caching intermediate values) and one backward pass (multiplying local derivatives). The computational cost of the backward pass is roughly 2x the forward pass, regardless of network depth.

Practical Consequences

Concept	What Happens	Engineering Impact
Activation checkpointing	Drop cached activations, recompute during backward	Trades compute for memory (critical for large models)
Gradient accumulation	Sum gradients over multiple mini-batches before updating	Simulates larger batch sizes with less GPU memory
Mixed precision	Forward in fp16/bf16, accumulate gradients in fp32	2x memory reduction, faster compute

The chain rule is also why residual connections are critical — they create additive shortcut paths through the chain, preventing gradient signal from vanishing over many multiplications.

→ Backpropagation is the chain rule applied efficiently. Without it, training deep networks would be computationally infeasible. It is why deep models are trainable systems rather than black boxes.

Python — Manual Backpropagation

import numpy as np

# Manual chain rule through a 2-layer network
# y = relu(x @ W1) @ W2
# loss = 0.5 * (y - target)^2

np.random.seed(42)
x = np.array([1.0, 2.0])               # input
W1 = np.random.randn(2, 3)               # layer 1 weights
W2 = np.random.randn(3, 1)               # layer 2 weights
target = np.array([1.0])                 # target output

# --- Forward pass (cache intermediates) ---
z1 = x @ W1                               # pre-activation
h1 = np.maximum(0, z1)                   # ReLU activation
y = h1 @ W2                               # output
loss = 0.5 * (y - target) ** 2            # MSE loss

# --- Backward pass (chain rule, right to left) ---
dL_dy = y - target                         # dL/dy
dL_dW2 = h1.reshape(-1, 1) @ dL_dy.reshape(1, -1)  # dL/dW2
dL_dh1 = dL_dy @ W2.T                     # chain: dL/dh1
dL_dz1 = dL_dh1 * (z1 > 0)               # ReLU derivative
dL_dW1 = x.reshape(-1, 1) @ dL_dz1.reshape(1, -1)  # dL/dW1

print(f"Loss: {loss.item():.4f}")
print(f"dL/dW1 shape: {dL_dW1.shape}")
print(f"dL/dW2 shape: {dL_dW2.shape}")

Follow-up Questions

What is activation checkpointing and why does it help?

During backpropagation, you need the cached activations from the forward pass to compute local derivatives. Activation checkpointing discards most cached activations and recomputes them on-the-fly during the backward pass. This trades ~33% more compute for ~60–70% less memory, enabling training of models that otherwise would not fit in GPU memory.

Why is the backward pass roughly 2x the cost of the forward pass?

The forward pass computes one matrix multiply per layer. The backward pass computes two: one to propagate the gradient to the previous layer (dL/dx = J^T * dL/dy) and one to compute the parameter gradient (dL/dW = x^T * dL/dy). Each has similar cost to the forward multiply, hence the ~2x factor.

How does gradient accumulation work?

Instead of updating weights after every mini-batch, you sum (accumulate) gradients across N mini-batches and update once. This is mathematically equivalent to using a batch size N times larger, but uses only 1/N the memory. It is essential for training large models on limited hardware where the maximum batch size is small.

Stability & Structure

The architectural and mathematical mechanisms that keep deep transformer training stable — activation functions, residual paths, normalization, and dimensionality reduction.

The ReLU Derivative and Activation Functions

ReLU passes positive inputs unchanged and zeros out negative inputs. Its derivative is 1 for positive inputs and 0 for negative, making it computationally simple and helping gradients flow better than older saturating activations like sigmoid. This simplicity was a key enabler of deep network training.

💡 ReLU is a one-way gate: positive signals pass through at full strength, negative signals are blocked. The gate's derivative is binary — either "pass" (1) or "block" (0) — which keeps gradient computation trivial.

— f(x) — f'(x) (derivative)

Why ReLU Changed Deep Learning

Before ReLU, networks used sigmoid and tanh activations. Both saturate at extreme values, producing derivatives near zero. When you multiply many near-zero derivatives via the chain rule, gradients vanish exponentially. ReLU's derivative of exactly 1 for positive inputs avoids this: gradients pass through unchanged.

Activation Function Comparison

Function	Formula	Derivative	Used In
ReLU	max(0, x)	0 or 1	Earlier transformers, CNNs
GELU	x * Phi(x)	Smooth curve	GPT, BERT, most modern LLMs
SiLU/Swish	x * sigmoid(x)	Smooth, non-monotonic	LLaMA, PaLM
Sigmoid	1/(1+exp(-x))	sigma*(1-sigma)	Legacy, gating mechanisms

Modern Transformers: GELU and SiLU

Most current LLMs use GELU (Gaussian Error Linear Unit) or SiLU/Swish instead of plain ReLU. These smooth approximations of ReLU have non-zero gradients for slightly negative inputs, which helps with optimization. The key insight remains the same: avoid saturating derivatives that kill gradient flow.

The "dead neuron" problem of ReLU (neurons stuck at zero output) is largely avoided by GELU and SiLU, which is one reason modern architectures prefer them.

→ ReLU's non-saturating derivative was the historical breakthrough. Modern LLMs use smoother variants (GELU, SiLU) that preserve the gradient-flow benefit while avoiding dead neurons.

Python — Activation Functions and Derivatives

import numpy as np

def relu(x):
    """ReLU: simple, fast, but can produce dead neurons."""
    return np.maximum(0, x)

def relu_derivative(x):
    """Binary derivative: 1 if x > 0, else 0."""
    return (x > 0).astype(float)

def gelu(x):
    """GELU: used in GPT, BERT. Smooth approximation of ReLU."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

# Compare gradient flow through 50 layers
x = np.array([0.5])
relu_grad = 1.0
sigmoid_grad = 1.0

for _ in range(50):
    relu_grad *= relu_derivative(x)[0]           # stays 1.0
    sig = 1 / (1 + np.exp(-x[0]))
    sigmoid_grad *= sig * (1 - sig)               # shrinks fast

print(f"After 50 layers:")
print(f"  ReLU gradient:    {relu_grad:.6f}")      # 1.000000
print(f"  Sigmoid gradient: {sigmoid_grad:.2e}")   # ~1e-38 (vanished!)

Follow-up Questions

What is the "dead neuron" problem in ReLU?

If a neuron's input is always negative (due to unlucky initialization or large learning rates), its ReLU output is always zero and its gradient is always zero. The neuron is permanently dead and can never recover. This affects a small but non-trivial fraction of neurons in deep ReLU networks. GELU and SiLU avoid this because they have non-zero gradients for negative inputs.

Why did transformers switch from ReLU to GELU?

GELU was introduced in the BERT paper and produces slightly better training dynamics for transformers. The smooth transition around zero (instead of ReLU's sharp corner) gives more informative gradients for inputs near zero, which are common after layer normalization. The computational overhead is negligible on modern hardware.

Residual Connections and Normalization

Residual connections create short paths through which gradients can flow directly, reducing the chance that signal disappears across many layers. Normalization keeps activations in stable numerical ranges. Together, they are what make very deep transformer models practical at all.

💡 The residual connection is a highway bypass — even if the main road (the transformer layer) has traffic jams (vanishing gradients), the highway lets the gradient signal through.

Input x

LayerNorm

↓

Self-Attention

x + Attn(x)

↓

LayerNorm

↓

Feed-Forward

y + FF(y)

Output

Residual Connections: The Gradient Highway

A residual connection computes output = x + F(x) instead of output = F(x). During backpropagation, the gradient splits: dL/dx = dL/doutput * (1 + dF/dx). The +1 term is the residual path — it passes gradient through unchanged regardless of what F does. Even if dF/dx approaches zero (vanishing gradient through the layer), the gradient still flows via the identity path.

Layer Normalization

LayerNorm normalizes activations across the feature dimension to have zero mean and unit variance. This prevents activations from drifting to extreme values across layers, keeping the inputs to each layer in a well-conditioned range. Without normalization, the accumulation of residual additions can cause activations to grow unboundedly.

Pre-Norm vs Post-Norm

Variant	Order	Behavior
Post-Norm (original)	Attention → Add → Norm	Harder to train deep models without warmup
Pre-Norm (modern)	Norm → Attention → Add	More stable training, preferred in most LLMs

Why Both Are Needed Together

Residual connections alone let gradients flow but can cause activation explosion. Normalization alone stabilizes activations but does not fix gradient vanishing through complex layers. Together, they create a training regime where 96-layer transformers train as reliably as 6-layer ones — a feat that seemed impossible before these techniques.

See Topic 7: Chain Rule and Backpropagation for why gradient flow matters, and Topic 6: The Jacobian for the mathematical framework.

→ Residual connections provide gradient highways; normalization prevents activation drift. Together, they are the architectural reason why 100-layer transformers can train stably.

Python — Transformer Block with Residual + Norm

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    """
    A single transformer block with pre-norm residual connections.
    This is the standard architecture in modern LLMs (GPT, LLaMA).
    """
    def __init__(self, d_model, n_heads):
        super().__init__()
        # Pre-norm: normalize BEFORE the sublayer
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),                          # modern activation
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x):
        # Residual connection 1: x + Attention(Norm(x))
        normed = self.norm1(x)
        attn_out, _ = self.attn(normed, normed, normed)
        x = x + attn_out                       # gradient highway

        # Residual connection 2: x + FF(Norm(x))
        x = x + self.ff(self.norm2(x))          # gradient highway
        return x

Follow-up Questions

What is RMSNorm and why do some models prefer it?

RMSNorm (Root Mean Square Normalization) normalizes by the RMS of activations instead of subtracting the mean and dividing by the standard deviation. It is computationally simpler (no mean computation) and has been shown to work equally well in practice. LLaMA and several other modern LLMs use RMSNorm for its efficiency advantage.

Can you have too many residual connections?

In principle, each residual connection adds the input to the output, so activations grow with depth (linear accumulation). Normalization counteracts this by re-scaling after each block. The standard pattern of one residual + one norm per sub-layer is well-calibrated; adding extra skip connections can actually hurt by making the effective network shallower (the model learns to "skip" more layers).

Eigenvalues, Eigenvectors, and Dimensionality Reduction

Eigenvectors identify directions of variation in data; eigenvalues quantify how much variance each direction explains. Keeping the leading eigenvectors lets you compress data while preserving important structure — the mathematical foundation behind PCA, low-rank approximations, and why LoRA works.

💡 Eigenvalues are like a priority list for dimensions. The top eigenvalues say "this direction matters most." Dimensionality reduction is keeping the VIP dimensions and dropping the noise.

What Eigenvectors and Eigenvalues Are

For a square matrix A, an eigenvector v satisfies A * v = lambda * v. The matrix only scales the eigenvector, without changing its direction. The scaling factor lambda is the eigenvalue. In PCA, the covariance matrix's eigenvectors point along the principal axes of variation in the data, and eigenvalues tell you how spread out the data is along each axis.

Connection to LLM Practice

Application	How Eigen-Analysis Appears
LoRA	Low-rank adapters work because weight updates are approximately low-rank — most change concentrates in a few directions (large eigenvalues)
Embedding visualization	PCA projects high-dimensional token embeddings into 2D/3D for visualization
Training stability	The eigenvalues of the Hessian (curvature matrix) determine whether optimization is stable
Compression	SVD (which uses eigen-decomposition) is the mathematical backbone of model compression

Why Low Rank Works for Adaptation

When you fine-tune a model, the weight updates delta_W tend to be approximately low-rank: most of the change concentrates in a small subspace. This is why LoRA (see the chapter on fine-tuning) can capture most of the adaptation quality with rank 16 or 32 out of 4096 dimensions. The eigenvalue spectrum of typical weight updates drops off quickly — the top few directions carry most of the information.

PCA in One Paragraph

Principal Component Analysis computes the eigenvectors of the data's covariance matrix, sorts them by eigenvalue, and keeps the top k. Projecting data onto these k vectors gives a k-dimensional representation that preserves as much variance as possible. In LLM work, you rarely apply PCA inside the model, but the concept explains why lower-dimensional projections and latent spaces can retain useful structure.

→ Eigenvalues reveal which dimensions carry information and which carry noise. This principle underlies LoRA, PCA, SVD-based compression, and the intuition for why low-rank approximations work in practice.

Python — PCA and Eigenvalues

import numpy as np

# Generate some 2D data with clear principal directions
np.random.seed(42)
n = 200
data = np.random.randn(n, 2) @ np.array([[3, 1], [1, 0.5]])

# Step 1: Center the data
centered = data - data.mean(axis=0)

# Step 2: Compute covariance matrix
cov = (centered.T @ centered) / (n - 1)

# Step 3: Eigen-decomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov)

# Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Variance explained
total_var = eigenvalues.sum()
for i, ev in enumerate(eigenvalues):
    print(f"PC{i+1}: eigenvalue={ev:.3f}, variance explained={ev/total_var:.1%}")

# In LLM context: if eigenvalues drop off quickly,
# low-rank approximations (like LoRA) capture most of the signal

Follow-up Questions

How does SVD relate to eigenvalues?

Singular Value Decomposition (SVD) decomposes any matrix A = U * S * V^T. The singular values (diagonal of S) are the square roots of the eigenvalues of A^T * A. SVD generalizes eigen-decomposition to non-square matrices, which is why it is more directly useful for weight matrices in neural networks. LoRA's low-rank decomposition is conceptually related to truncated SVD.

Can you use PCA to compress a trained model?

Yes, but with caveats. You can apply SVD to weight matrices and keep only the top-k singular values/vectors, effectively reducing parameter count. This is a form of post-training compression. The quality impact depends on how quickly singular values decay. For attention weight matrices, the decay is often rapid, enabling significant compression with modest quality loss.

Why does LoRA assume low-rank updates will work?

Empirical evidence shows that fine-tuning weight deltas have low intrinsic dimensionality. The Aghajanyan et al. (2020) paper demonstrated that fine-tuning changes concentrate in a low-dimensional subspace. This means a rank-16 or rank-32 adapter can capture the essential adaptation despite the weight matrix being 4096-dimensional. The eigenvalue spectrum of the true delta drops off rapidly.