Ch 5: Pretraining Objectives, Model Families & Classical Comparisons

Foundations & Objectives

What language models are, how pretraining objectives differ, and the vocabulary strategies that let them handle any text.

What Is a Language Model?

A language model estimates the probability of token sequences. It is called "large" because modern versions use massive parameter counts, datasets, and compute — enabling broad statistical learning but also introducing new costs and failure modes like hallucination.

💡 Think of a language model as a probability engine: feed it a partial sequence, and it estimates what comes next (or what is missing). "Large" means the engine is powerful enough to internalize language-wide patterns, not just local statistics.

Autoregressive LM

Predict next token left-to-right

Masked LM

Recover hidden tokens from context

Seq2Seq Model

Map input sequence to output sequence

Foundation Model

Broad pretraining, then adaptation

Selected Family

Autoregressive LMs predict the next token given all prior tokens. They excel at free-form generation, dialogue, code completion, and any task framed as continuation. Examples: GPT series, LLaMA, Claude.

The Core Idea

At its most basic, a language model learns a probability distribution over sequences of tokens. Given a partial sequence, it can estimate which token is most likely to appear next, or which token best fills a gap in context. Every modern LLM is, at its core, this probability estimator — scaled up with deep architectures and enormous training sets.

Why "Large" Matters

"Large" does not only mean more parameters. It also implies:

Longer training runs over more diverse data, enabling broader generalization.
More sophisticated infrastructure — distributed training across hundreds or thousands of GPUs.
Bigger context windows that let the model consider more information at once.
New failure modes such as hallucination, distribution shift, and high deployment cost.

Scale brings capability but also cost and complexity. A strong interview answer connects both sides: why size enables broad statistical learning and why it introduces engineering challenges that did not exist with smaller models.

Model Family Landscape

The major model families differ in their pretraining objective, not just their brand name. Understanding this table is the fastest way to compare them:

Family	Objective	Best For
Autoregressive LM	Predict next token	Generation, dialogue, coding
Masked LM	Recover hidden tokens	Representation, classification, retrieval
Seq2Seq	Map input to output sequence	Translation, summarization
Foundation Model	Broad pretraining + adaptation	Multi-task reuse across products

→ A language model is a probability engine over token sequences. "Large" means it has internalized enough data to generalize broadly — but that power comes with cost, infrastructure, and failure-mode trade-offs.

Python — Exploring Token Probabilities

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a small autoregressive language model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Encode a prompt and get next-token probabilities
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    # logits shape: (batch, seq_len, vocab_size)
    logits = outputs.logits[:, -1, :]
    probs = torch.nn.functional.softmax(logits, dim=-1)

# Show the top 5 most likely next tokens
top5 = torch.topk(probs, 5)
for i in range(5):
    token_id = top5.indices[0][i].item()
    prob = top5.values[0][i].item()
    token_str = tokenizer.decode([token_id])
    print(f"  {token_str!r:12} p={prob:.4f}")

Follow-up Questions

What is the difference between a model and a model family?

A model family refers to a group of models that share the same pretraining objective and architectural pattern (e.g., autoregressive transformers). A specific model is a particular trained instance within that family (e.g., GPT-4, LLaMA 3). Comparing families by objective is more useful in interviews than comparing individual models by benchmark score.

Does "large" have a formal threshold?

No. There is no agreed-upon parameter count that makes a model "large." The term reflects a regime where scale itself creates emergent capabilities — behaviors that smaller models do not exhibit. In practice, models above roughly 1 billion parameters often qualify, but the boundary is fuzzy and shifts as the field advances.

Can a language model do things beyond language?

Yes. Because LLMs learn general sequence-to-sequence mappings, they can process structured data (JSON, SQL), code, mathematical notation, and even multimodal inputs when paired with vision encoders. The "language" in LLM is increasingly a misnomer — they are general sequence models that happen to be trained primarily on text.

Autoregressive vs Masked Models

Autoregressive models predict the next token from left context only, making them natural generators. Masked models predict hidden tokens from both directions, making them strong representation learners. Neither is universally better — they prepare the model for different default strengths.

💡 Autoregressive = reading a story and guessing what happens next. Masked = doing a crossword puzzle where you see the surrounding clues and fill in the blanks.

Autoregressive (GPT-style)
Each token only sees what came before it. Generation flows left to right.

Masked (BERT-style)

Hidden tokens are predicted using both left and right context.

Visible token

Future (hidden)

[MASK] target

Objective Shapes Behavior

The pretraining objective is the single most important design decision for a language model, because it determines the default behavior the model develops. Autoregressive objectives train the model to continue sequences, which makes them natural at generation, dialogue, and open-ended tasks. Masked objectives train the model to build rich internal representations of context, which makes them strong at classification, retrieval, and understanding tasks.

Detailed Comparison

Dimension	Autoregressive	Masked
Direction	Left-to-right only	Bidirectional
Primary strength	Generation, continuation	Representation, understanding
Inference mode	Token-by-token generation	Encode full input at once
Adaptation	Prompting, RLHF, fine-tuning	Fine-tuning with classification head
Examples	GPT, LLaMA, Claude	BERT, RoBERTa, DeBERTa

A common interview mistake is to rank these as better or worse in the abstract. The right approach is to connect them to task fit: if you need generation, autoregressive is the natural choice. If you need embedding-based retrieval or classification, a masked model may give better representations per compute dollar. See Topic 9: Generative vs Discriminative for the broader framing.

→ Separate generation from representation: autoregressive objectives train continuation, masked objectives train context understanding. Choose by task fit, not by brand.

Python — Autoregressive vs Masked Prediction

from transformers import pipeline

# --- Autoregressive: complete the sequence ---
gen = pipeline("text-generation", model="gpt2", max_new_tokens=10)
result = gen("The capital of France is")
print("GPT-2 continuation:", result[0]["generated_text"])

# --- Masked: fill in the blank ---
fill = pipeline("fill-mask", model="bert-base-uncased")
result = fill("The capital of France is [MASK].")
print("BERT fill-mask:")
for r in result[:3]:
    print(f"  {r['token_str']:12} score={r['score']:.4f}")

Follow-up Questions

Can autoregressive models also do classification?

Yes. Through prompting, an autoregressive model can be asked to output a label. However, it generates the label as text, which requires parsing and validation. Masked models can more naturally produce embeddings for a classification head. The trade-off is flexibility (autoregressive) versus efficiency and calibration (masked). See Topic 9: Generative vs Discriminative.

Why did autoregressive models win for general-purpose AI assistants?

Because most user-facing tasks require open-ended generation: answering questions, writing code, summarizing documents. The autoregressive objective is a natural fit for these. Masked models cannot generate fluently in the same way because they were not trained to produce sequences token by token.

Is there a model that combines both objectives?

Yes. Models like T5 use an encoder-decoder architecture where the encoder processes bidirectional context and the decoder generates autoregressively. XLNet used a permutation-based objective to capture bidirectional context within an autoregressive framework. These hybrids try to get the best of both worlds.

Masked Language Modeling

Masked language modeling (MLM) randomly hides a subset of tokens and asks the model to predict them from surrounding context. Because prediction depends on both left and right neighbors, the model learns bidirectional contextual representations — the foundation of BERT-style models.

💡 MLM is like a cloze test: the teacher removes words from a passage and the student must fill them in by understanding the whole sentence, not just the preceding words.

Click a token to mask it and see how surrounding context predicts the hidden word.

Random masking: ~15% of tokens are replaced with [MASK], random tokens, or left unchanged.

Bidirectional encoding: The transformer sees all unmasked tokens and their positions simultaneously.

Prediction: The model outputs a probability distribution over the vocabulary for each masked position.

How MLM Works

During pretraining, roughly 15% of tokens are selected for masking. Of those, 80% are replaced with a special [MASK] token, 10% are replaced with a random token, and 10% are left unchanged. This mix prevents the model from only learning to recognize the [MASK] symbol and forces it to build robust representations for all positions.

Why Bidirectionality Matters

Because the masked token can depend on words appearing both before and after it, the model must learn to integrate context from all directions. This produces richer contextual embeddings than a left-to-right autoregressive model, which only ever looks backward. These embeddings are why BERT-style models proved so effective for:

Search ranking — understanding whether a document matches a query
Classification — sentiment analysis, toxicity detection, topic assignment
Sentence-pair tasks — natural language inference, paraphrase detection
Token-level tasks — named entity recognition, part-of-speech tagging

Compare this with the autoregressive objective described in Topic 2: Autoregressive vs Masked Models, which focuses on generation rather than representation.

Limitations

MLM does not train the model to generate text. A BERT-style model cannot produce a coherent paragraph the way GPT can, because it was never trained to predict sequences token by token. It excels at encoding input for downstream tasks, not at producing new output.

→ MLM teaches contextual understanding, not generation. That is why BERT-family models dominate search, classification, and embedding tasks where rich bidirectional representations matter most.

Python — Masked Language Modeling with BERT

from transformers import pipeline

# Create a fill-mask pipeline using BERT
fill_mask = pipeline("fill-mask", model="bert-base-uncased")

# The model predicts what word best fills the [MASK] position
# using BOTH left and right context (bidirectional)
sentence = "The [MASK] sat on the mat and purred loudly."

predictions = fill_mask(sentence)
print(f"Input: {sentence}")
print("Top predictions:")
for p in predictions[:5]:
    # Each prediction uses context from "The" AND "sat on the mat..."
    print(f"  {p['token_str']:12} confidence={p['score']:.4f}")

Follow-up Questions

Why mask only 15% of tokens instead of more?

The 15% rate balances two concerns. Masking too many tokens makes prediction too hard and deprives the model of context it needs to learn. Masking too few slows training because each example provides less supervision signal. The 15% rate was found empirically in the original BERT paper (Devlin et al., 2019) to be a good trade-off.

What improvements did RoBERTa make over BERT's MLM?

RoBERTa (Liu et al., 2019) kept the MLM objective but made several training improvements: it removed the Topic 4: Next Sentence Prediction task, used dynamic masking (different masks each epoch), trained on more data with larger batches, and trained for longer. These changes produced significantly better downstream results without changing the core objective.

Can you use MLM for data augmentation?

Yes. A technique called masked augmentation uses MLM to generate plausible substitutions for tokens in training data. This can increase training set diversity for downstream classifiers. However, the quality depends on the masked model's domain knowledge — a general-purpose BERT may produce poor substitutions for highly specialized domains.

Next Sentence Prediction

Next sentence prediction (NSP) is a pretraining task where the model decides whether one sentence naturally follows another. It helped early BERT learn coarse discourse relationships, but later work showed it is not always necessary — making it more of a historical milestone than a universal recipe.

💡 NSP is like a reading comprehension check: "Does sentence B logically follow sentence A?" It teaches paragraph-level coherence, not just word-level prediction.

Toggle between real and random next-sentence pairs to see how NSP training works.

Sentence A

The cat sat on the warm windowsill.

→

Sentence B

It stretched lazily in the afternoon sun.

The Original Design

In the original BERT paper (Devlin et al., 2019), NSP was paired with Topic 3: Masked Language Modeling as a joint pretraining objective. The model received two segments: 50% of the time, sentence B actually followed sentence A in the corpus (IsNext), and 50% of the time, sentence B was a random sentence (NotNext). The model had to classify which case it was seeing.

Why It Mattered

NSP was designed to teach cross-sentence reasoning, which is essential for tasks like:

Natural language inference (NLI) — Does premise entail hypothesis?
Question answering — Does this passage contain the answer?
Paraphrase detection — Do these two sentences mean the same thing?

Why It Fell Out of Favor

Later models like RoBERTa, ALBERT, and SpanBERT found that removing NSP or replacing it with alternative objectives (like sentence order prediction) produced equal or better results. The consensus is that MLM alone, when combined with enough data and training time, captures sufficient discourse-level signal. NSP remains historically important as an illustration of how pretraining objectives shape downstream behavior.

→ NSP taught BERT paragraph-level coherence, but later research showed MLM alone can capture similar signal. It is a historical milestone that illustrates how objective design shapes model capabilities.

Python — NSP with BERT

from transformers import BertTokenizer, BertForNextSentencePrediction
import torch

# Load BERT with its NSP head
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")

# Real pair: sentence B follows sentence A
sent_a = "The cat sat on the warm windowsill."
sent_b = "It stretched lazily in the afternoon sun."

# Tokenize as a sentence pair ([CLS] A [SEP] B [SEP])
inputs = tokenizer(sent_a, sent_b, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    # logits[0] = IsNext score, logits[1] = NotNext score
    prediction = torch.argmax(outputs.logits, dim=1).item()

print(f"Prediction: {'IsNext' if prediction == 0 else 'NotNext'}")

Follow-up Questions

What replaced NSP in later models?

RoBERTa simply dropped NSP entirely. ALBERT replaced it with sentence-order prediction (SOP), where the model must determine if two consecutive sentences appear in the correct order or are swapped. SpanBERT replaced it with span boundary objective (SBO). All found comparable or better downstream performance.

Is NSP still relevant for modern LLMs?

Not directly. Modern autoregressive LLMs like GPT-4 and Claude do not use NSP at all — their next-token prediction objective naturally learns discourse coherence from long-context training data. NSP is primarily relevant when discussing the historical evolution of pretraining objectives and BERT-family design choices.

Could NSP help with retrieval or reranking tasks?

In principle, an NSP-like objective teaches the model about passage coherence, which is related to relevance. However, modern retrieval models typically use contrastive learning objectives instead, which directly optimize for query-document similarity. NSP is too coarse a signal for high-quality retrieval.

Subword OOV Handling

Modern language models avoid the out-of-vocabulary problem by using subword tokenization. Instead of requiring every full word to exist in the vocabulary, they break unfamiliar words into smaller known pieces — so any text can be processed, even if the specific word has never been seen before.

💡 Subword tokenization is like a modular alphabet: even if you have never seen the word "unbelievable," you can still read it because you know "un," "believ," and "able."

Click a word to see how subword tokenization decomposes it into known fragments.

electroencephalography

↓

The Old Problem

Earlier NLP systems used fixed word-level vocabularies. Any word not in the vocabulary was replaced with a special [UNK] token, destroying information. This was especially problematic for proper nouns, technical terms, typos, and morphologically rich languages.

The Subword Solution

Subword tokenization algorithms like BPE (Byte Pair Encoding), WordPiece, and SentencePiece build vocabularies of frequent character sequences. Common words become single tokens; rare words decompose into recognizable pieces. This means:

No hard OOV — Every string can be decomposed, down to individual bytes if necessary.
Morphological sharing — "running," "runner," and "runs" share the stem "run," helping the model generalize.
Cross-lingual reuse — Shared subwords across languages enable multilingual models.

The Practical Lesson

OOV handling moved from dictionary design to tokenization design. The model may not know the semantics of a brand-new term, but it can ingest and process the text because the tokenizer decomposes it into known fragments. The model then relies on context and subword patterns to infer meaning.

Strategy	Era	OOV Handling
Word-level vocab	Pre-2016	Replace unknown words with [UNK]
Character-level	~2015	No OOV, but very long sequences
Subword (BPE/WordPiece)	2016+	Decompose into known subword units
Byte-level BPE	2019+	Decompose to raw bytes as fallback

→ Subword tokenization eliminated the hard OOV problem. Modern models can process any text by decomposing it into known fragments — the tokenizer handles representation, and the model infers meaning from context.

Python — Subword Decomposition

import tiktoken

# Load GPT-4's tokenizer
enc = tiktoken.get_encoding("cl100k_base")

# Show how unfamiliar words are decomposed into subwords
test_words = [
    "electroencephalography",
    "transformerization",
    "ChatGPT",
    "COVID-19",
    "running",
]

for word in test_words:
    tokens = enc.encode(word)
    # Decode each token individually to see the subword pieces
    pieces = [enc.decode([t]) for t in tokens]
    print(f"  {word:28} -> {pieces} ({len(tokens)} tokens)")

Follow-up Questions

Does the model actually understand a word it has never seen?

Not necessarily. The tokenizer can ingest the word by splitting it into known pieces, but the model's understanding depends on whether the subword fragments carry useful semantic signal and whether the surrounding context provides enough clues. A completely novel abbreviation with no morphological hints may be processed but poorly understood.

How does subword tokenization affect non-English languages?

Languages underrepresented in the training data tend to fragment more heavily, producing higher fertility (more tokens per word). This increases both cost and latency for those languages. Multilingual tokenizers try to balance vocabulary allocation across languages, but English-centric models still have a bias toward efficient English tokenization.

What is byte-level BPE and how does it relate to OOV handling?

Byte-level BPE operates on raw UTF-8 bytes rather than Unicode characters. This guarantees that any byte sequence can be tokenized, providing a true zero-OOV guarantee. GPT-2 and later OpenAI models use byte-level BPE. The downside is slightly longer sequences for common text, but the universality benefit is significant.

Architectures & Comparisons

From sequence-to-sequence framing to the transformer revolution, and how to compare model paradigms in interviews.

Sequence-to-Sequence Models

A sequence-to-sequence (Seq2Seq) model maps one sequence into another, often with different lengths and surface forms. It is a task framing, not a single architecture — older Seq2Seq used RNNs with attention, modern versions use encoder-decoder transformers.

💡 Seq2Seq is like a translator at a conference: listen to the full input in one language (encoder), then produce the output in another language (decoder), word by word.

Encoder (Input)

The cat is on the mat.

→

ENCODE→DECODE

→

Decoder (Output)

Le chat est sur le tapis.

Core Architecture Pattern

Seq2Seq models have two main components: an encoder that reads the input sequence and produces a representation, and a decoder that generates the output sequence token by token, conditioned on the encoder's representation. The encoder and decoder can be any sequence model — RNNs, LSTMs, or transformers.

Classic Use Cases

Seq2Seq shines when input and output have clearly distinct roles:

Machine translation — English to French, source to target
Summarization — Long document to short abstract
Text normalization — Messy input to clean output
Code generation — Natural language description to code

T5 and the Text-to-Text Framing

The T5 model (Raffel et al., 2020) pushed Seq2Seq to its logical extreme by framing every NLP task as a text-to-text problem. Classification becomes "classify: [input]" → "positive." Summarization becomes "summarize: [input]" → "[summary]." This unification simplified multi-task training and showed that the Seq2Seq framing is more general than it first appears. Compare this with the Topic 8: Foundation vs Task-Specific discussion.

→ Seq2Seq is a task framing (input sequence to output sequence), not a single architecture. It is most natural when input and output have clearly different roles, lengths, or surface forms.

Python — T5 Seq2Seq Summarization

from transformers import pipeline

# T5 treats every task as text-to-text (seq2seq)
summarizer = pipeline("summarization", model="t5-small")

article = """
Transformers replaced recurrent models by using self-attention
to process all tokens in parallel. This made training faster
and enabled models to capture long-range dependencies more
effectively. The architecture has become the standard for
both natural language processing and computer vision tasks.
"""

# The encoder reads the full article,
# the decoder generates the summary token by token
result = summarizer(article, max_length=50, min_length=10)
print("Summary:", result[0]["summary_text"])

Follow-up Questions

How does a Seq2Seq encoder-decoder differ from a decoder-only model?

An encoder-decoder has two distinct components: the encoder processes the full input bidirectionally, then the decoder generates output autoregressively while attending to the encoder's representations. A decoder-only model (like GPT) does everything in one pass — the "input" is simply prepended to the sequence and both input processing and output generation share the same left-to-right architecture.

Are modern LLM chat systems actually Seq2Seq?

Functionally yes, conceptually no. Chat models like GPT-4 and Claude are decoder-only autoregressive models, but they behave like Seq2Seq systems: user message goes in, assistant response comes out. The difference is architectural — there is no separate encoder. The "encoding" happens implicitly in the early layers of the same decoder that produces the output.

When would you choose an encoder-decoder over a decoder-only model today?

Encoder-decoders can be more parameter-efficient for tasks with clear input/output asymmetry, like translation or structured extraction, because the encoder can build a rich bidirectional representation of the input. For general-purpose generation and open-ended tasks, decoder-only models have become dominant due to simpler scaling and training dynamics.

Transformers vs RNNs

Transformers replaced RNN-based Seq2Seq models because self-attention handles long-range dependencies better and enables massive parallelization during training. RNNs process tokens sequentially, which creates a bottleneck for both speed and signal propagation over long sequences.

💡 An RNN is like reading a book one word at a time and trying to remember everything. A transformer is like spreading the entire book on a table and looking at any passage whenever you need it.

Key milestones in the transition from recurrent to attention-based architectures.

2014

RNN + Attention

2017

Transformer Architecture

2018

BERT & GPT

2020

GPT-3 & Scaling Laws

2023+

SSMs & Hybrids

RNN Seq2Seq models with attention (Bahdanau et al.) enabled machine translation breakthroughs but were limited by sequential processing.

Why RNNs Struggled at Scale

Recurrent neural networks process tokens sequentially: the hidden state at position t depends on the hidden state at position t-1. This creates two problems:

Training bottleneck — Sequential processing prevents parallelization, making training on large datasets extremely slow.
Vanishing gradients — Information from early tokens fades as it propagates through many time steps, making it hard to capture long-range dependencies even with LSTM/GRU gates.

How Transformers Fixed It

The transformer architecture (Vaswani et al., 2017) replaced recurrence with self-attention, allowing every token to attend to every other token in a single layer. This provides:

Full parallelism — All positions are computed simultaneously during training.
Direct signal paths — Token 1 can directly attend to token 1000 without passing through 999 intermediate states.
Scalability — Parallel training enabled models to grow from millions to hundreds of billions of parameters.

The Cost of Attention

Self-attention has quadratic complexity in sequence length (O(n²)), which becomes expensive for very long contexts. This has driven research into efficient attention variants and state-space models, but standard transformers remain dominant for most production LLMs. See Topic 10: LLMs vs Classical Models for the broader historical arc.

→ Transformers displaced RNNs by enabling parallel training and direct long-range attention. This scalability unlocked the foundation model era — without it, billion-parameter models would have been impractical.

Python — Attention vs Recurrence Complexity

import math

def compare_complexity(seq_lengths):
    """Compare theoretical compute for RNN vs Transformer."""
    print(f"{'Seq Len':>10} {'RNN (O(n))':>14} {'Attn (O(n^2))':>16} {'Ratio':>8}")
    print("-" * 52)
    for n in seq_lengths:
        rnn_ops = n            # Sequential: O(n) steps
        attn_ops = n * n       # Self-attention: O(n^2) comparisons
        # But RNN cannot parallelize, transformer can
        rnn_wall = n           # Wall-clock ~ n (sequential)
        attn_wall = n          # Wall-clock ~ n with parallel hardware
        print(f"{n:>10,} {rnn_ops:>14,} {attn_ops:>16,}  {attn_ops/rnn_ops:>6.1f}x")

compare_complexity([128, 512, 2048, 8192, 32768])

Follow-up Questions

Are RNNs completely dead in modern NLP?

Not entirely. RNN variants appear in some edge deployment scenarios where model size must be very small, and new architectures like Mamba (a state-space model) borrow ideas from recurrence for linear-complexity sequence processing. However, for mainstream LLM development, transformers dominate overwhelmingly.

What are state-space models and could they replace transformers?

State-space models (SSMs) like Mamba and S4 process sequences with linear complexity by maintaining a compressed state rather than computing full attention. They show promising results on long-context tasks. However, transformers have a massive ecosystem advantage (tooling, optimization, proven scaling), so replacement is unlikely in the near term. Hybrids that combine attention with SSM layers are more probable.

How does positional encoding substitute for the implicit ordering RNNs provide?

RNNs encode position implicitly through their sequential processing order. Transformers, which process all positions in parallel, need explicit positional encodings (sinusoidal, learned, or rotary) added to token embeddings so the model knows where each token sits in the sequence. Without positional encodings, a transformer would treat the input as a bag of tokens.

Foundation vs Task-Specific Models

A foundation model is pretrained broadly on large, diverse corpora so it can later be adapted to many tasks. A task-specific model is trained or fine-tuned for a narrow job. The trade-off is breadth versus specialization: foundation models shift effort from repeated training toward adaptation, prompting, or lightweight tuning.

💡 A foundation model is like a Swiss Army knife — versatile and ready for many jobs. A task-specific model is like a surgeon's scalpel — highly specialized and more precise for one task.

Foundation Model

Training dataBroad, diverse

Task coverageMany tasks

AdaptationPrompting, LoRA, RLHF

Deployment costHigh

Update cycleProvider-driven

Task-Specific Model

Training dataDomain-focused

Task coverageSingle task

AdaptationFull fine-tuning

Deployment costLow

Update cycleTeam-driven

The Foundation Model Paradigm

Foundation models (Bommasani et al., 2021) represent a shift in how ML teams build products. Instead of training a separate model for each task, teams start from a broadly pretrained base and adapt it. This provides:

Amortized training cost — One expensive pretraining run supports many downstream applications.
Fast iteration — Prompting or light fine-tuning is cheaper and faster than training from scratch.
Emergent capabilities — Foundation models exhibit behaviors not explicitly trained for, such as few-shot learning and chain-of-thought reasoning.

When Task-Specific Still Wins

Foundation models are not always the right choice:

Factor	Foundation Model	Task-Specific Model
Label set stability	Handles evolving labels well	Better for fixed taxonomies
Inference cost	Higher (large model, API fees)	Lower (small, self-hosted)
Latency	Higher	Lower
Control & safety	Harder to control precisely	Easier to audit and constrain
Data requirements	Minimal (few-shot or zero-shot)	Needs labeled training data

The best interview answer connects both: one base model can support many products, but that breadth also creates challenges in control, safety, and cost. See Topic 9: Generative vs Discriminative for the related modeling trade-off.

→ Foundation models shift ML effort from repeated task-by-task training toward adaptation. They are powerful because one model supports many products, but that breadth creates real trade-offs in cost, control, and safety.

Python — Foundation Model Adaptation Strategies

# Three ways to adapt a foundation model for a specific task

# 1. Zero-shot prompting (no training data needed)
prompt_zeroshot = """Classify the following review as positive or negative.
Review: "This product exceeded all my expectations!"
Classification:"""

# 2. Few-shot prompting (a few examples in-context)
prompt_fewshot = """Classify reviews as positive or negative.

Review: "Absolutely love it!" -> positive
Review: "Terrible quality." -> negative
Review: "This product exceeded all my expectations!"
Classification:"""

# 3. LoRA fine-tuning (lightweight parameter adaptation)
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,                  # Low-rank dimension
    lora_alpha=32,          # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
)
# peft_model = get_peft_model(base_model, config)
# Trains only ~0.1% of parameters, keeps base model frozen

Follow-up Questions

What are the main risks of depending on a third-party foundation model?

Key risks include API deprecation (the provider retires the model version), pricing changes, behavior drift between model updates, data privacy concerns (sending sensitive data to an external API), and vendor lock-in. Production teams mitigate these with abstraction layers, fallback models, and regular evaluation against held-out test sets.

Is fine-tuning a foundation model the same as building a task-specific model?

Partially. Fine-tuning starts from the foundation model's broad knowledge and specializes it, which is faster and requires less data than training from scratch. But a heavily fine-tuned model may lose some of its general capabilities (catastrophic forgetting). Lightweight techniques like LoRA preserve more generality than full fine-tuning.

How do you decide between prompting and fine-tuning?

Start with prompting for rapid prototyping and small-scale use. Move to fine-tuning when you need consistent behavior at scale, lower latency, or tighter cost control. The decision matrix: if your labeled data is small and labels change often, prompt. If data is large, labels are stable, and volume is high, fine-tune.

Generative vs Discriminative Models

Generative models learn to model how data is produced, enabling them to create new samples like text continuations. Discriminative models focus on mapping inputs to labels. The line is not absolute — powerful generative models can perform discriminative tasks through prompting, but dedicated discriminative models are often more efficient for narrow prediction tasks.

💡 A generative model is like a novelist who can write new stories. A discriminative model is like a literary critic who can tell you whether a story is good or bad. The novelist can also critique, but the dedicated critic is faster at it.

Where different models fall on the generative-discriminative spectrum.

Logistic Regression

BERT + Head

GPT / Claude

Purely Discriminative Purely Generative

The Core Distinction

Generative models learn P(X) or P(X|context) — the probability distribution over data. This lets them sample new data points (generate text, images, etc.). Discriminative models learn P(Y|X) — the probability of a label given an input. They focus on decision boundaries rather than data generation.

Practical Comparison

Aspect	Generative	Discriminative
Flexibility	Can handle many tasks via prompting	Specialized to one task
Efficiency	Higher cost per prediction	Lower cost, faster inference
Calibration	Harder to calibrate confidence	Easier to calibrate
Data needs	Can work zero/few-shot	Needs labeled training data
Output control	May produce unexpected formats	Structured by design

The Blurring Line

Modern generative LLMs can perform discriminative tasks (classification, scoring, decision-making) through prompting. This has blurred the traditional boundary. A prompted GPT-4 can classify sentiment, but a fine-tuned BERT classifier will do it faster, cheaper, and with better-calibrated confidence scores for the same task. The right choice depends on whether the product needs open-ended generation or tightly controlled prediction. See Topic 8: Foundation vs Task-Specific for the related deployment trade-off.

→ Generative models are more flexible; discriminative models are more efficient for narrow tasks. The best choice depends on whether you need open-ended generation or tightly controlled, cost-effective prediction.

Python — Generative vs Discriminative Classification

from transformers import pipeline

# --- Discriminative approach: dedicated sentiment classifier ---
# Fast, cheap, well-calibrated confidence scores
classifier = pipeline("sentiment-analysis")
result = classifier("This movie was absolutely fantastic!")
print("Discriminative:", result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

# --- Generative approach: prompt an LLM to classify ---
# Flexible, can explain reasoning, but slower and pricier
generator = pipeline("text-generation", model="gpt2", max_new_tokens=5)
prompt = """Classify as positive or negative:
"This movie was absolutely fantastic!"
Sentiment:"""
result = generator(prompt)
print("Generative:", result[0]["generated_text"])

Follow-up Questions

When would you use a generative model for classification instead of a discriminative one?

When the label taxonomy changes frequently, when you need explanations alongside labels, when labeled training data is scarce, or when you are in an early exploration phase and want to iterate quickly. Once the task stabilizes, switching to a discriminative model often saves significant cost.

Can you use both in a single pipeline?

Yes. A common hybrid pattern uses a fast discriminative model for high-confidence cases and falls back to a generative model for ambiguous or novel inputs. This balances cost efficiency with flexibility. The discriminative model handles the bulk of traffic, and the generative model handles the long tail.

What does "calibration" mean in this context?

Calibration means that when the model says it is 90% confident, it should be correct about 90% of the time. Discriminative models trained with proper loss functions tend to be better calibrated. Generative models often produce overconfident or poorly calibrated verbal confidence statements, requiring external calibration techniques.

LLMs vs Classical Statistical Models

Classical statistical language models like n-grams estimate probabilities from local token counts and short fixed histories. LLMs use deep architectures with distributed representations to capture long-range context, enabling them to generalize beyond memorized counts and transfer across tasks — but at vastly higher cost.

💡 An n-gram model is like a lookup table that checks the last few words. An LLM is like a reader who has absorbed millions of books and can reason about meaning, context, and intent.

Classical (n-gram)

✓

Fast, interpretable

✓

Low resource requirements

✗

Fixed short context window

✗

No transfer learning

✗

Struggles with rare patterns

Modern (LLM)

✓

Rich contextual understanding

✓

Cross-task transfer

✓

Handles long dependencies

✗

Expensive to train & deploy

✗

Less interpretable

How Classical Models Work

An n-gram model estimates the probability of a word based on the previous n-1 words. A trigram model looks at two previous words: P("cat" | "the", "fat"). These probabilities come from counting occurrences in a corpus and applying smoothing techniques to handle unseen combinations.

The fundamental limitation is fixed context. A 5-gram model cannot consider anything beyond the last 4 words, no matter how relevant earlier context might be.

What LLMs Changed

LLMs introduced three paradigm shifts:

Distributed representations — Instead of discrete counts, words become dense vectors that capture semantic similarity. "King" and "queen" are nearby in embedding space, unlike in a count-based model.
Deep architectures — Multiple layers of computation allow the model to build hierarchical representations, capturing syntax, semantics, and even reasoning patterns.
Transfer learning — A model pretrained on general text can be adapted to new tasks without starting from scratch, which was impossible with n-gram models.

When Classical Models Still Win

Classical models remain useful in specific scenarios:

Interpretability — You can inspect exact n-gram counts and probabilities.
Low-resource deployment — An n-gram model fits in kilobytes; an LLM needs gigabytes.
Spelling correction and keyboard prediction — Fast, local, no API call needed.
Baseline evaluation — N-gram perplexity is a standard benchmark comparison.

The historical arc connects directly to the pretraining objectives covered in Topic 1: What Is a Language Model? and the architectural shifts in Topic 7: Transformers vs RNNs.

→ Classical models are lookup-based with smoothing; LLMs are representation learners. Classical systems remain interpretable and cheap, but cannot match the contextual flexibility, reasoning, and transfer capacity of transformer-based LLMs.

Python — N-gram vs Neural Language Model

from collections import Counter, defaultdict

# === Simple bigram language model (classical approach) ===
corpus = "the cat sat on the mat the cat ate the food"
words = corpus.split()

# Count bigram frequencies
bigrams = defaultdict(Counter)
for i in range(len(words) - 1):
    bigrams[words[i]][words[i + 1]] += 1

# Predict next word given "the"
context = "the"
total = sum(bigrams[context].values())
print(f"Bigram predictions after '{context}':")
for word, count in bigrams[context].most_common():
    print(f"  {word:8} p={count / total:.3f}")

# Limitation: bigram only sees 1 previous word
# An LLM sees the entire context window (thousands of tokens)
# and encodes meaning, not just surface-level co-occurrence

Follow-up Questions

What is smoothing and why do n-gram models need it?

Smoothing (e.g., Laplace, Kneser-Ney) assigns small non-zero probabilities to unseen n-grams so the model does not assign zero probability to valid sequences. This is necessary because any finite corpus cannot contain all possible word combinations. LLMs do not need explicit smoothing because their distributed representations naturally generalize to unseen combinations.

Could you combine n-gram features with neural models?

Yes. Hybrid approaches have been explored where n-gram statistics serve as features for neural models, or where n-gram models handle high-frequency patterns while neural models handle the long tail. In practice, modern LLMs have made this unnecessary for most applications because they learn n-gram-like patterns implicitly in their lower layers.

Is perplexity still a useful metric for evaluating LLMs?

Perplexity measures how well a model predicts a held-out text corpus and is inversely related to the model's probability assignment. It is still useful for comparing language models on the same data, but it does not capture important qualities like instruction following, safety, or factuality. Modern LLM evaluation combines perplexity with task-specific benchmarks and human evaluation.