What language models are, how pretraining objectives differ, and the vocabulary strategies that let them handle any text.
What Is a Language Model?
The Core Idea
At its most basic, a language model learns a probability distribution over sequences of tokens. Given a partial sequence, it can estimate which token is most likely to appear next, or which token best fills a gap in context. Every modern LLM is, at its core, this probability estimator — scaled up with deep architectures and enormous training sets.
Why "Large" Matters
"Large" does not only mean more parameters. It also implies:
- Longer training runs over more diverse data, enabling broader generalization.
- More sophisticated infrastructure — distributed training across hundreds or thousands of GPUs.
- Bigger context windows that let the model consider more information at once.
- New failure modes such as hallucination, distribution shift, and high deployment cost.
Scale brings capability but also cost and complexity. A strong interview answer connects both sides: why size enables broad statistical learning and why it introduces engineering challenges that did not exist with smaller models.
Model Family Landscape
The major model families differ in their pretraining objective, not just their brand name. Understanding this table is the fastest way to compare them:
| Family | Objective | Best For |
|---|---|---|
| Autoregressive LM | Predict next token | Generation, dialogue, coding |
| Masked LM | Recover hidden tokens | Representation, classification, retrieval |
| Seq2Seq | Map input to output sequence | Translation, summarization |
| Foundation Model | Broad pretraining + adaptation | Multi-task reuse across products |
Python — Exploring Token Probabilities
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load a small autoregressive language model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Encode a prompt and get next-token probabilities
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# logits shape: (batch, seq_len, vocab_size)
logits = outputs.logits[:, -1, :]
probs = torch.nn.functional.softmax(logits, dim=-1)
# Show the top 5 most likely next tokens
top5 = torch.topk(probs, 5)
for i in range(5):
token_id = top5.indices[0][i].item()
prob = top5.values[0][i].item()
token_str = tokenizer.decode([token_id])
print(f" {token_str!r:12} p={prob:.4f}")
What is the difference between a model and a model family?
Does "large" have a formal threshold?
Can a language model do things beyond language?
Autoregressive vs Masked Models
Objective Shapes Behavior
The pretraining objective is the single most important design decision for a language model, because it determines the default behavior the model develops. Autoregressive objectives train the model to continue sequences, which makes them natural at generation, dialogue, and open-ended tasks. Masked objectives train the model to build rich internal representations of context, which makes them strong at classification, retrieval, and understanding tasks.
Detailed Comparison
| Dimension | Autoregressive | Masked |
|---|---|---|
| Direction | Left-to-right only | Bidirectional |
| Primary strength | Generation, continuation | Representation, understanding |
| Inference mode | Token-by-token generation | Encode full input at once |
| Adaptation | Prompting, RLHF, fine-tuning | Fine-tuning with classification head |
| Examples | GPT, LLaMA, Claude | BERT, RoBERTa, DeBERTa |
A common interview mistake is to rank these as better or worse in the abstract. The right approach is to connect them to task fit: if you need generation, autoregressive is the natural choice. If you need embedding-based retrieval or classification, a masked model may give better representations per compute dollar. See Topic 9: Generative vs Discriminative for the broader framing.
Python — Autoregressive vs Masked Prediction
from transformers import pipeline
# --- Autoregressive: complete the sequence ---
gen = pipeline("text-generation", model="gpt2", max_new_tokens=10)
result = gen("The capital of France is")
print("GPT-2 continuation:", result[0]["generated_text"])
# --- Masked: fill in the blank ---
fill = pipeline("fill-mask", model="bert-base-uncased")
result = fill("The capital of France is [MASK].")
print("BERT fill-mask:")
for r in result[:3]:
print(f" {r['token_str']:12} score={r['score']:.4f}")
Can autoregressive models also do classification?
Why did autoregressive models win for general-purpose AI assistants?
Is there a model that combines both objectives?
Masked Language Modeling
How MLM Works
During pretraining, roughly 15% of tokens are selected for masking. Of those, 80% are replaced with a special [MASK] token, 10% are replaced with a random token, and 10% are left unchanged. This mix prevents the model from only learning to recognize the [MASK] symbol and forces it to build robust representations for all positions.
Why Bidirectionality Matters
Because the masked token can depend on words appearing both before and after it, the model must learn to integrate context from all directions. This produces richer contextual embeddings than a left-to-right autoregressive model, which only ever looks backward. These embeddings are why BERT-style models proved so effective for:
- Search ranking — understanding whether a document matches a query
- Classification — sentiment analysis, toxicity detection, topic assignment
- Sentence-pair tasks — natural language inference, paraphrase detection
- Token-level tasks — named entity recognition, part-of-speech tagging
Compare this with the autoregressive objective described in Topic 2: Autoregressive vs Masked Models, which focuses on generation rather than representation.
Limitations
MLM does not train the model to generate text. A BERT-style model cannot produce a coherent paragraph the way GPT can, because it was never trained to predict sequences token by token. It excels at encoding input for downstream tasks, not at producing new output.
Python — Masked Language Modeling with BERT
from transformers import pipeline
# Create a fill-mask pipeline using BERT
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
# The model predicts what word best fills the [MASK] position
# using BOTH left and right context (bidirectional)
sentence = "The [MASK] sat on the mat and purred loudly."
predictions = fill_mask(sentence)
print(f"Input: {sentence}")
print("Top predictions:")
for p in predictions[:5]:
# Each prediction uses context from "The" AND "sat on the mat..."
print(f" {p['token_str']:12} confidence={p['score']:.4f}")
Why mask only 15% of tokens instead of more?
What improvements did RoBERTa make over BERT's MLM?
Can you use MLM for data augmentation?
Next Sentence Prediction
The Original Design
In the original BERT paper (Devlin et al., 2019), NSP was paired with Topic 3: Masked Language Modeling as a joint pretraining objective. The model received two segments: 50% of the time, sentence B actually followed sentence A in the corpus (IsNext), and 50% of the time, sentence B was a random sentence (NotNext). The model had to classify which case it was seeing.
Why It Mattered
NSP was designed to teach cross-sentence reasoning, which is essential for tasks like:
- Natural language inference (NLI) — Does premise entail hypothesis?
- Question answering — Does this passage contain the answer?
- Paraphrase detection — Do these two sentences mean the same thing?
Why It Fell Out of Favor
Later models like RoBERTa, ALBERT, and SpanBERT found that removing NSP or replacing it with alternative objectives (like sentence order prediction) produced equal or better results. The consensus is that MLM alone, when combined with enough data and training time, captures sufficient discourse-level signal. NSP remains historically important as an illustration of how pretraining objectives shape downstream behavior.
Python — NSP with BERT
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
# Load BERT with its NSP head
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")
# Real pair: sentence B follows sentence A
sent_a = "The cat sat on the warm windowsill."
sent_b = "It stretched lazily in the afternoon sun."
# Tokenize as a sentence pair ([CLS] A [SEP] B [SEP])
inputs = tokenizer(sent_a, sent_b, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# logits[0] = IsNext score, logits[1] = NotNext score
prediction = torch.argmax(outputs.logits, dim=1).item()
print(f"Prediction: {'IsNext' if prediction == 0 else 'NotNext'}")
What replaced NSP in later models?
Is NSP still relevant for modern LLMs?
Could NSP help with retrieval or reranking tasks?
Subword OOV Handling
The Old Problem
Earlier NLP systems used fixed word-level vocabularies. Any word not in the vocabulary was replaced with a special [UNK] token, destroying information. This was especially problematic for proper nouns, technical terms, typos, and morphologically rich languages.
The Subword Solution
Subword tokenization algorithms like BPE (Byte Pair Encoding), WordPiece, and SentencePiece build vocabularies of frequent character sequences. Common words become single tokens; rare words decompose into recognizable pieces. This means:
- No hard OOV — Every string can be decomposed, down to individual bytes if necessary.
- Morphological sharing — "running," "runner," and "runs" share the stem "run," helping the model generalize.
- Cross-lingual reuse — Shared subwords across languages enable multilingual models.
The Practical Lesson
OOV handling moved from dictionary design to tokenization design. The model may not know the semantics of a brand-new term, but it can ingest and process the text because the tokenizer decomposes it into known fragments. The model then relies on context and subword patterns to infer meaning.
| Strategy | Era | OOV Handling |
|---|---|---|
| Word-level vocab | Pre-2016 | Replace unknown words with [UNK] |
| Character-level | ~2015 | No OOV, but very long sequences |
| Subword (BPE/WordPiece) | 2016+ | Decompose into known subword units |
| Byte-level BPE | 2019+ | Decompose to raw bytes as fallback |
Python — Subword Decomposition
import tiktoken
# Load GPT-4's tokenizer
enc = tiktoken.get_encoding("cl100k_base")
# Show how unfamiliar words are decomposed into subwords
test_words = [
"electroencephalography",
"transformerization",
"ChatGPT",
"COVID-19",
"running",
]
for word in test_words:
tokens = enc.encode(word)
# Decode each token individually to see the subword pieces
pieces = [enc.decode([t]) for t in tokens]
print(f" {word:28} -> {pieces} ({len(tokens)} tokens)")
Does the model actually understand a word it has never seen?
How does subword tokenization affect non-English languages?
What is byte-level BPE and how does it relate to OOV handling?
From sequence-to-sequence framing to the transformer revolution, and how to compare model paradigms in interviews.
Sequence-to-Sequence Models
Core Architecture Pattern
Seq2Seq models have two main components: an encoder that reads the input sequence and produces a representation, and a decoder that generates the output sequence token by token, conditioned on the encoder's representation. The encoder and decoder can be any sequence model — RNNs, LSTMs, or transformers.
Classic Use Cases
Seq2Seq shines when input and output have clearly distinct roles:
- Machine translation — English to French, source to target
- Summarization — Long document to short abstract
- Text normalization — Messy input to clean output
- Code generation — Natural language description to code
T5 and the Text-to-Text Framing
The T5 model (Raffel et al., 2020) pushed Seq2Seq to its logical extreme by framing every NLP task as a text-to-text problem. Classification becomes "classify: [input]" → "positive." Summarization becomes "summarize: [input]" → "[summary]." This unification simplified multi-task training and showed that the Seq2Seq framing is more general than it first appears. Compare this with the Topic 8: Foundation vs Task-Specific discussion.
Python — T5 Seq2Seq Summarization
from transformers import pipeline
# T5 treats every task as text-to-text (seq2seq)
summarizer = pipeline("summarization", model="t5-small")
article = """
Transformers replaced recurrent models by using self-attention
to process all tokens in parallel. This made training faster
and enabled models to capture long-range dependencies more
effectively. The architecture has become the standard for
both natural language processing and computer vision tasks.
"""
# The encoder reads the full article,
# the decoder generates the summary token by token
result = summarizer(article, max_length=50, min_length=10)
print("Summary:", result[0]["summary_text"])
How does a Seq2Seq encoder-decoder differ from a decoder-only model?
Are modern LLM chat systems actually Seq2Seq?
When would you choose an encoder-decoder over a decoder-only model today?
Transformers vs RNNs
Why RNNs Struggled at Scale
Recurrent neural networks process tokens sequentially: the hidden state at position t depends on the hidden state at position t-1. This creates two problems:
- Training bottleneck — Sequential processing prevents parallelization, making training on large datasets extremely slow.
- Vanishing gradients — Information from early tokens fades as it propagates through many time steps, making it hard to capture long-range dependencies even with LSTM/GRU gates.
How Transformers Fixed It
The transformer architecture (Vaswani et al., 2017) replaced recurrence with self-attention, allowing every token to attend to every other token in a single layer. This provides:
- Full parallelism — All positions are computed simultaneously during training.
- Direct signal paths — Token 1 can directly attend to token 1000 without passing through 999 intermediate states.
- Scalability — Parallel training enabled models to grow from millions to hundreds of billions of parameters.
The Cost of Attention
Self-attention has quadratic complexity in sequence length (O(n²)), which becomes expensive for very long contexts. This has driven research into efficient attention variants and state-space models, but standard transformers remain dominant for most production LLMs. See Topic 10: LLMs vs Classical Models for the broader historical arc.
Python — Attention vs Recurrence Complexity
import math
def compare_complexity(seq_lengths):
"""Compare theoretical compute for RNN vs Transformer."""
print(f"{'Seq Len':>10} {'RNN (O(n))':>14} {'Attn (O(n^2))':>16} {'Ratio':>8}")
print("-" * 52)
for n in seq_lengths:
rnn_ops = n # Sequential: O(n) steps
attn_ops = n * n # Self-attention: O(n^2) comparisons
# But RNN cannot parallelize, transformer can
rnn_wall = n # Wall-clock ~ n (sequential)
attn_wall = n # Wall-clock ~ n with parallel hardware
print(f"{n:>10,} {rnn_ops:>14,} {attn_ops:>16,} {attn_ops/rnn_ops:>6.1f}x")
compare_complexity([128, 512, 2048, 8192, 32768])
Are RNNs completely dead in modern NLP?
What are state-space models and could they replace transformers?
How does positional encoding substitute for the implicit ordering RNNs provide?
Foundation vs Task-Specific Models
The Foundation Model Paradigm
Foundation models (Bommasani et al., 2021) represent a shift in how ML teams build products. Instead of training a separate model for each task, teams start from a broadly pretrained base and adapt it. This provides:
- Amortized training cost — One expensive pretraining run supports many downstream applications.
- Fast iteration — Prompting or light fine-tuning is cheaper and faster than training from scratch.
- Emergent capabilities — Foundation models exhibit behaviors not explicitly trained for, such as few-shot learning and chain-of-thought reasoning.
When Task-Specific Still Wins
Foundation models are not always the right choice:
| Factor | Foundation Model | Task-Specific Model |
|---|---|---|
| Label set stability | Handles evolving labels well | Better for fixed taxonomies |
| Inference cost | Higher (large model, API fees) | Lower (small, self-hosted) |
| Latency | Higher | Lower |
| Control & safety | Harder to control precisely | Easier to audit and constrain |
| Data requirements | Minimal (few-shot or zero-shot) | Needs labeled training data |
The best interview answer connects both: one base model can support many products, but that breadth also creates challenges in control, safety, and cost. See Topic 9: Generative vs Discriminative for the related modeling trade-off.
Python — Foundation Model Adaptation Strategies
# Three ways to adapt a foundation model for a specific task
# 1. Zero-shot prompting (no training data needed)
prompt_zeroshot = """Classify the following review as positive or negative.
Review: "This product exceeded all my expectations!"
Classification:"""
# 2. Few-shot prompting (a few examples in-context)
prompt_fewshot = """Classify reviews as positive or negative.
Review: "Absolutely love it!" -> positive
Review: "Terrible quality." -> negative
Review: "This product exceeded all my expectations!"
Classification:"""
# 3. LoRA fine-tuning (lightweight parameter adaptation)
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16, # Low-rank dimension
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05,
)
# peft_model = get_peft_model(base_model, config)
# Trains only ~0.1% of parameters, keeps base model frozen
What are the main risks of depending on a third-party foundation model?
Is fine-tuning a foundation model the same as building a task-specific model?
How do you decide between prompting and fine-tuning?
Generative vs Discriminative Models
The Core Distinction
Generative models learn P(X) or P(X|context) — the probability distribution over data. This lets them sample new data points (generate text, images, etc.). Discriminative models learn P(Y|X) — the probability of a label given an input. They focus on decision boundaries rather than data generation.
Practical Comparison
| Aspect | Generative | Discriminative |
|---|---|---|
| Flexibility | Can handle many tasks via prompting | Specialized to one task |
| Efficiency | Higher cost per prediction | Lower cost, faster inference |
| Calibration | Harder to calibrate confidence | Easier to calibrate |
| Data needs | Can work zero/few-shot | Needs labeled training data |
| Output control | May produce unexpected formats | Structured by design |
The Blurring Line
Modern generative LLMs can perform discriminative tasks (classification, scoring, decision-making) through prompting. This has blurred the traditional boundary. A prompted GPT-4 can classify sentiment, but a fine-tuned BERT classifier will do it faster, cheaper, and with better-calibrated confidence scores for the same task. The right choice depends on whether the product needs open-ended generation or tightly controlled prediction. See Topic 8: Foundation vs Task-Specific for the related deployment trade-off.
Python — Generative vs Discriminative Classification
from transformers import pipeline
# --- Discriminative approach: dedicated sentiment classifier ---
# Fast, cheap, well-calibrated confidence scores
classifier = pipeline("sentiment-analysis")
result = classifier("This movie was absolutely fantastic!")
print("Discriminative:", result)
# [{'label': 'POSITIVE', 'score': 0.9998}]
# --- Generative approach: prompt an LLM to classify ---
# Flexible, can explain reasoning, but slower and pricier
generator = pipeline("text-generation", model="gpt2", max_new_tokens=5)
prompt = """Classify as positive or negative:
"This movie was absolutely fantastic!"
Sentiment:"""
result = generator(prompt)
print("Generative:", result[0]["generated_text"])
When would you use a generative model for classification instead of a discriminative one?
Can you use both in a single pipeline?
What does "calibration" mean in this context?
LLMs vs Classical Statistical Models
How Classical Models Work
An n-gram model estimates the probability of a word based on the previous n-1 words. A trigram model looks at two previous words: P("cat" | "the", "fat"). These probabilities come from counting occurrences in a corpus and applying smoothing techniques to handle unseen combinations.
The fundamental limitation is fixed context. A 5-gram model cannot consider anything beyond the last 4 words, no matter how relevant earlier context might be.
What LLMs Changed
LLMs introduced three paradigm shifts:
- Distributed representations — Instead of discrete counts, words become dense vectors that capture semantic similarity. "King" and "queen" are nearby in embedding space, unlike in a count-based model.
- Deep architectures — Multiple layers of computation allow the model to build hierarchical representations, capturing syntax, semantics, and even reasoning patterns.
- Transfer learning — A model pretrained on general text can be adapted to new tasks without starting from scratch, which was impossible with n-gram models.
When Classical Models Still Win
Classical models remain useful in specific scenarios:
- Interpretability — You can inspect exact n-gram counts and probabilities.
- Low-resource deployment — An n-gram model fits in kilobytes; an LLM needs gigabytes.
- Spelling correction and keyboard prediction — Fast, local, no API call needed.
- Baseline evaluation — N-gram perplexity is a standard benchmark comparison.
The historical arc connects directly to the pretraining objectives covered in Topic 1: What Is a Language Model? and the architectural shifts in Topic 7: Transformers vs RNNs.
Python — N-gram vs Neural Language Model
from collections import Counter, defaultdict
# === Simple bigram language model (classical approach) ===
corpus = "the cat sat on the mat the cat ate the food"
words = corpus.split()
# Count bigram frequencies
bigrams = defaultdict(Counter)
for i in range(len(words) - 1):
bigrams[words[i]][words[i + 1]] += 1
# Predict next word given "the"
context = "the"
total = sum(bigrams[context].values())
print(f"Bigram predictions after '{context}':")
for word, count in bigrams[context].most_common():
print(f" {word:8} p={count / total:.3f}")
# Limitation: bigram only sees 1 previous word
# An LLM sees the entire context window (thousands of tokens)
# and encodes meaning, not just surface-level co-occurrence