Ch 2: Tokens, Tokenization & Context Windows

1

What Is a Token?

A token is the smallest piece of text an LLM processes — not a character, not a word, but a subword chunk that the model's tokenizer has learned to recognize. Everything the model reads, generates, and bills you for is measured in tokens.

💡 A token is to an LLM what a pixel is to an image — the smallest unit the system can see.

0Characters

0Words

0Tokens

0Chars/Token

From Text to Numbers

Large language models cannot read text directly. Before a single word reaches the neural network, a tokenizer splits the input into a sequence of tokens — integer IDs that map into the model's learned vocabulary. Each token ID points to a row in an embedding matrix, producing a dense vector the model can actually compute with.

The vocabulary is fixed at training time. GPT-4's tokenizer (cl100k_base) has roughly 100,000 entries; Claude's is similar in scale. Every string you send is decomposed into a sequence drawn from that finite set.

Token Economy

Tokens are the billing unit of every major LLM API. When a provider quotes "$3 per million input tokens," they mean the tokenized length — not characters, not words. Because different text types tokenize at different densities, the same number of characters can cost wildly different amounts:

English prose: ~1 token per 4 characters
Python code: ~1 token per 3 characters (more whitespace, symbols)
JSON with long keys: ~1 token per 3.5 characters
CJK text: ~1 token per 1.5 characters (each character often becomes its own token)

Why This Matters

Understanding tokens unlocks three practical skills:

Cost estimation — You can predict API spend before sending a request.
Prompt engineering — You know exactly how much room you have inside a Topic 7: Context Windows.
Debugging odd behavior — Spelling errors, hallucinated words, and strange code completions often trace back to how the tokenizer split the input. See Topic 2: Tokens vs Words for the word/token mismatch that causes most surprises.

→ Tokens are the atomic currency of LLMs — if you don't understand tokenization, you can't reason about cost, speed, or capacity.

Python Example

import tiktoken

# Load the tokenizer used by GPT-4 / ChatGPT
enc = tiktoken.get_encoding("cl100k_base")

text = "The quick brown fox jumps over the lazy dog."
tokens = enc.encode(text)

print(f"Text:   {text}")
print(f"Tokens: {tokens}")
print(f"Count:  {len(tokens)}")

# Decode each token back to its string piece
for tid in tokens:
    print(f"  {tid:>6} -> {enc.decode([tid])!r}")

Follow-up Questions

How big is a typical vocabulary, and why not just use one token per character?

Modern LLM vocabularies range from 32,000 to 150,000 entries. A character-level vocabulary would be tiny (~256 entries for UTF-8 bytes) but sequences would be extremely long, making self-attention prohibitively expensive since its cost grows quadratically with sequence length. Subword tokenization strikes a balance between vocabulary size and sequence length.

Do different models use different tokenizers?

Yes. GPT-4 uses cl100k_base, GPT-3 used r50k_base, and Claude uses its own proprietary tokenizer. The same sentence can produce different token counts across models, which means cost estimates are model-specific. Always use the correct tokenizer library for the model you are targeting.

What happens when the model encounters a word it has never seen?

Subword tokenizers like BPE never truly encounter an "unknown" word. Any novel string is decomposed into smaller known subword pieces, down to individual bytes if necessary. This is a major advantage over older word-level tokenizers that required a special UNK token for out-of-vocabulary words.

Does whitespace count as tokens?

Yes, but whitespace is usually merged into the token that follows it. In most modern tokenizers a leading space is part of the next token — so " hello" (with a space) is a single token, different from "hello" without a space. This is why indentation in code can significantly affect token counts.

2

Tokens vs Words

Words and tokens are not the same thing. A single word can become multiple tokens, and a single token can span parts of multiple words. The ratio between them — called fertility — determines your real cost.

💡 If words are whole LEGO bricks, tokens are the studs and plates — the model builds meaning from smaller, reusable pieces.

Words

Tokens

Words

Tokens

Fertility: --

Why the Mismatch

Words are a human concept with fuzzy boundaries. Is "don't" one word or two? Is "state-of-the-art" one word or four? Tokenizers don't care about these debates. They split text according to statistical patterns learned during training via algorithms like Topic 3: Byte-Pair Encoding.

Common words like "the", "is", and "hello" usually map to a single token. But longer or rarer words get split into subword pieces. "Uncharacteristically" might become four tokens: ["Un", "character", "istic", "ally"]. Meanwhile, frequent multi-character sequences like " the" (with a leading space) or "\n\n" are single tokens.

Fertility: The Hidden Cost Multiplier

Fertility is the average number of tokens per word. English prose typically has a fertility around 1.2–1.4, meaning 100 words become roughly 120–140 tokens. But this ratio shifts dramatically:

Input Type	Typical Fertility	Why
Simple English	1.1–1.3	Common words are single tokens
Technical English	1.4–1.8	Domain jargon splits into subwords
Python code	1.8–2.5	Symbols, indentation, identifiers
JSON	2.0–3.0	Brackets, colons, quoted keys
German	2.0–3.5	Long compound words
Korean / Thai	3.0–5.0+	Each syllable or character may be its own token

This has direct cost implications. Sending the same semantic content in Korean can cost 3–4x more in tokens than English. For multilingual applications, fertility analysis is essential for budgeting. See Topic 7: Context Windows for how fertility impacts context budgets.

→ The token-to-word ratio (fertility) determines your real API cost — and it varies 10x across input types and languages.

Python Example

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

samples = {
    "English":  "The quick brown fox jumps over the lazy dog.",
    "Code":     "def fibonacci(n):\n    return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "JSON":     '{"users": [{"name": "Alice", "age": 30}]}',
}

for label, text in samples.items():
    words = text.split()
    tokens = enc.encode(text)
    fertility = len(tokens) / len(words)
    print(f"{label:>8}: {len(words)} words -> {len(tokens)} tokens  (fertility {fertility:.2f})")

Follow-up Questions

Does this mean non-English languages are more expensive to use with LLMs?

Yes, in most current tokenizers. Languages with longer words (German, Finnish), non-Latin scripts (Arabic, Thai), or logographic systems (Chinese, Japanese) produce higher fertility ratios, meaning more tokens per word and therefore higher API costs for the same semantic content. Some newer tokenizers are being designed to reduce this disparity.

Can I write prompts that use fewer tokens?

Yes. Using common English words, avoiding unnecessary formatting (extra whitespace, verbose JSON keys), and writing concisely all reduce token count. Prompt compression techniques like abbreviating repeated instructions or using shorter variable names in code blocks can save 20–40% of tokens in some cases.

Should I chunk text by tokens instead of by words or characters?

Absolutely. When splitting text for retrieval-augmented generation (RAG) or for fitting within context limits, always chunk by token count. Chunking by characters or words can overshoot or undershoot the model's actual limits, leading to truncation errors or wasted capacity.

3

Byte-Pair Encoding (BPE)

BPE builds a tokenizer vocabulary from the bottom up: start with individual characters, then repeatedly merge the most frequent adjacent pair into a new token. After thousands of merges, common words are single tokens while rare words decompose into known subword pieces.

💡 BPE is like learning shorthand — common letter pairs get abbreviations first, then abbreviations combine into longer ones.

INIT Starting with individual characters. Click Step or Auto-play to begin merging.

Vocabulary (0 entries)

The UNK Problem

Early NLP systems used fixed word-level vocabularies. Any word not in the vocabulary became <UNK> (unknown). For a 50,000-word vocabulary, even common misspellings, new slang, or compound words in German would vanish into UNK. BPE solved this completely: because it can decompose any string into subword pieces (and ultimately into individual bytes), there is no such thing as an unknown input.

The Algorithm Step by Step

Initialize: Start with a vocabulary of all individual characters (or bytes) found in the training corpus.
Count pairs: Scan all tokens in the corpus and count every adjacent pair (bigram).
Merge the top pair: Take the most frequent pair and merge it into a single new token. Add this token to the vocabulary.
Repeat: Go back to step 2. Continue until you reach the target vocabulary size (e.g., 50,000 merges).

The merge rules are saved in order. At inference time, the tokenizer applies the same merges in the same order to any new text, guaranteeing deterministic tokenization.

Vocabulary Size Tradeoffs

Vocab Size	Pros	Cons
Small (~8K)	Compact model, fewer parameters in embedding layer	Longer sequences, more tokens per word
Medium (~32K)	Good balance for most languages	May still split technical terms
Large (~100K+)	Shorter sequences, common phrases as single tokens	Larger embedding matrix, sparser training signal per token

Which Models Use BPE?

BPE (and its variant byte-level BPE) is used by GPT-2, GPT-3, GPT-4, Claude, LLaMA, Mistral, and most modern LLMs. The main alternative is Topic 4: SentencePiece, which uses either BPE or unigram mode with a different pre-tokenization approach.

→ BPE eliminated the unknown-word problem — every input decomposes into known subword pieces.

Python Example

# Simplified BPE training loop
from collections import Counter

def train_bpe(corpus, num_merges):
    # Split words into character lists
    words = [list(w) + ['</w>'] for w in corpus.split()]
    merges = []

    for i in range(num_merges):
        # Count all adjacent pairs
        pairs = Counter()
        for word in words:
            for j in range(len(word) - 1):
                pairs[(word[j], word[j+1])] += 1

        if not pairs:
            break

        best = max(pairs, key=pairs.get)
        merges.append(best)
        print(f"Merge {i+1}: {best[0]} + {best[1]} -> {best[0]+best[1]}")

        # Apply merge to all words
        for word in words:
            j = 0
            while j < len(word) - 1:
                if (word[j], word[j+1]) == best:
                    word[j:j+2] = [best[0] + best[1]]
                else:
                    j += 1

    return merges

merges = train_bpe("low lower newest widest low low", 10)

Follow-up Questions

What is byte-level BPE, and how does it differ from character-level BPE?

Byte-level BPE starts with the 256 possible byte values instead of Unicode characters. This means the base vocabulary is fixed and tiny, and any byte sequence (including binary data) can be tokenized without UNK tokens. GPT-2 introduced this approach, and GPT-4 and Claude both use variants of it.

Does a larger vocabulary always produce better results?

Not necessarily. Larger vocabularies shrink sequence lengths (fewer tokens per sentence) but increase the embedding matrix size and require more training data per token to learn good representations. Most modern LLMs settle around 32K to 128K tokens as a practical sweet spot.

How does BPE differ from the unigram model used in SentencePiece?

BPE builds vocabulary bottom-up by merging pairs. The unigram model works top-down: it starts with a large candidate vocabulary and iteratively removes tokens that contribute least to the corpus likelihood. Unigram can assign probabilities to multiple tokenizations of the same string, while BPE is deterministic.

Can BPE merges cross word boundaries?

In standard BPE, merges happen within pre-tokenized words, so they cannot cross word boundaries. However, byte-level BPE with regex-based pre-tokenization (as in GPT-4's tiktoken) defines word boundaries via regex patterns, which can sometimes group spaces with following characters into the same pre-token before merges are applied.

4

SentencePiece

SentencePiece is a language-agnostic tokenizer that treats the input as a raw byte stream — no whitespace splitting, no language-specific rules. One model file tokenizes any script, from English to Thai to Arabic, without modification.

💡 If BPE assumes words are pre-separated, SentencePiece looks at the raw page and figures out the words itself.

Classic Pipeline

1 Language detection

2 Whitespace word splitting

3 Language-specific rules

4 Apply BPE merges

SentencePiece Pipeline

1 NFKC normalization

2 Treat as raw byte stream ✓

3 BPE or Unigram model ✓

4 Output token IDs ✓

Tokenization Result

Why Whitespace Assumptions Break

Traditional NLP pipelines assume words are separated by spaces. This works for English, French, and German, but fails catastrophically for:

Japanese: "私は学生です" has no spaces at all
Thai: "สวัสดีครับ" uses spaces only between sentences, not words
Chinese: Characters map to morphemes, not words — word boundaries are ambiguous

Building a separate tokenizer for each language is expensive, error-prone, and creates a maintenance nightmare. SentencePiece eliminates this by treating all text as a sequence of Unicode characters (or bytes), with whitespace represented as a special character (▁) rather than used as a delimiter. See Topic 3: Byte-Pair Encoding for the underlying merge algorithm.

How SentencePiece Works

Normalize: Apply NFKC Unicode normalization to canonicalize characters (e.g., full-width "A" becomes normal "A").
Escape whitespace: Replace spaces with the meta-symbol ▁ (U+2581). The input becomes one continuous string.
Train or apply: Use either BPE (bottom-up merging) or the unigram language model (top-down pruning) to build or apply the vocabulary.
Output: A sequence of token IDs and a single .model file that works for any language.

BPE vs Unigram Mode

Feature	BPE Mode	Unigram Mode
Direction	Bottom-up (merge pairs)	Top-down (prune vocabulary)
Determinism	One tokenization per input	Multiple possible tokenizations, picks best by likelihood
Regularization	Not built-in	Subword regularization (sample different tokenizations during training)
Used by	LLaMA, Mistral	T5, mBART, ALBERT

→ SentencePiece made tokenization truly language-agnostic — one algorithm, one model file, any script.

Python Example

import sentencepiece as spm

# Train a SentencePiece model on a text file
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='my_tokenizer',
    vocab_size=8000,
    model_type='bpe',          # or 'unigram'
    character_coverage=0.9995,  # cover 99.95% of characters
)

# Load and use
sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model')

text = "SentencePiece works for any language!"
pieces = sp.encode(text, out_type=str)
ids    = sp.encode(text, out_type=int)

print(f"Pieces: {pieces}")
print(f"IDs:    {ids}")
print(f"Decoded: {sp.decode(ids)}")

Follow-up Questions

How does SentencePiece compare to tiktoken (used by OpenAI)?

tiktoken uses byte-level BPE with regex-based pre-tokenization — it splits on regex patterns first, then applies BPE within those chunks. SentencePiece skips pre-tokenization entirely. tiktoken is faster at inference because it is implemented in Rust, but SentencePiece is more flexible for training custom tokenizers on multilingual data.

What is NFKC normalization, and can it cause problems?

NFKC (Normalization Form Compatibility Composition) maps visually similar characters to a canonical form — for example, full-width digits become ASCII digits, and ligatures split into separate characters. This can rarely cause issues when the exact original byte sequence matters, such as in code where Unicode identifiers are intentional.

Can I add custom pre-tokenization rules to SentencePiece?

Yes. SentencePiece supports custom normalization rules and user-defined symbols via its training configuration. You can force specific strings to remain as single tokens (e.g., domain-specific abbreviations) or define custom splitting rules. However, heavy customization can reduce the language-agnostic benefits that make SentencePiece attractive in the first place.

5

Special Tokens

Special tokens are reserved symbols that never appear in normal text — they mark boundaries between messages, signal the start and end of generation, and tell the model which role is speaking. They are invisible to users but fundamental to how every chat interaction is structured.

💡 Special tokens are like punctuation in music notation — rests and bar lines don't make sound, but without them the piece falls apart.

Token Reference Table

Token	Name	Purpose	Used By
`<\|begin_of_text\|>`	BOS	Marks the very start of the input sequence	LLaMA 3, Mistral
`<\|end_of_text\|>`	EOS	Signals the model to stop generating	LLaMA 3, GPT
`<\|im_start\|>`	Role start	Begins a new message with a role (system/user/assistant)	ChatML format
`<\|im_end\|>`	Role end	Ends the current message	ChatML format
`[INST]` / `[/INST]`	Instruction	Wraps user instructions	LLaMA 2, Mistral
`<\|pad\|>`	Padding	Fills unused positions in fixed-length batches	Most models
`<\|sep\|>`	Separator	Separates segments (e.g., document from query)	BERT, T5

Why Fine-Tuning Goes Wrong

The most common fine-tuning mistake is getting special tokens wrong. If your training data uses a different chat template than the base model expects, the model cannot tell where one message ends and another begins. Symptoms include:

The model echoes the prompt back instead of responding
It generates text attributed to the wrong role
It refuses to stop generating (missing EOS)
It produces garbled output at message boundaries

Always verify that your fine-tuning data uses the exact same special token format as the base model's chat template.

Chat Templates

Each model family defines a chat template — the exact sequence of special tokens that wraps each message. The Hugging Face transformers library stores these as Jinja2 templates in the tokenizer config. When you call tokenizer.apply_chat_template(), it handles the formatting automatically.

Getting this right is essential. The same model will behave completely differently depending on whether special tokens are correctly placed. See Topic 1: What Is a Token? for how these special tokens are just integer IDs in the same vocabulary as regular tokens.

→ Special tokens are the skeleton of model input — invisible to users but they define every structural boundary.

Python Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

messages = [
    {"role": "system",    "content": "You are a helpful assistant."},
    {"role": "user",      "content": "What are special tokens?"},
    {"role": "assistant", "content": "Special tokens are..."},
]

# apply_chat_template handles all special tokens automatically
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(formatted)

# Inspect the special tokens in the vocabulary
print("Special tokens:", tokenizer.all_special_tokens)
print("Special IDs:",    tokenizer.all_special_ids)

Follow-up Questions

How can I debug special token issues in my prompts?

Encode your prompt with tokenizer.encode() and decode each token individually to see the exact sequence. Look for missing or duplicated BOS/EOS tokens, or role markers that don't match the model's expected template. Most issues become obvious when you inspect the raw token ID sequence.

What is ChatML, and why does it matter?

ChatML (Chat Markup Language) is a format introduced by OpenAI that uses <|im_start|> and <|im_end|> tokens to delimit messages. Many open-source models (Qwen, Yi, OpenChat) adopted it, making it a de facto standard. If you fine-tune on ChatML data but deploy with a different template, the model will not understand role boundaries.

Do special tokens affect the attention mask?

Yes. Padding tokens are masked out (attention weight = 0) so the model ignores them. BOS and EOS tokens are typically attended to. During training, the loss is usually computed only on assistant tokens, not on special tokens or user input, which is controlled by constructing the appropriate label mask.

6

Context Window

The context window is the total number of tokens a model can see at once — including input, conversation history, and the output it generates. Everything outside the window simply doesn't exist to the model.

💡 The context window is RAM, not a hard drive — only what's loaded can be used, and it's wiped between sessions.

System

History

Retrieved

Query

Output

Used: 0 / 128,000 tokens

Token budget exceeded! The model will truncate or refuse the request.

System Prompt 2,000

History 20,000

Retrieved Context 40,000

Query 2,000

Output Reserve 16,000

Attention Scope, Not Memory

Despite what the name suggests, the context window is not memory in any durable sense. It is the attention scope — the set of tokens the model's self-attention layers can attend to when generating the next token. Once a conversation ends or a token falls outside the window, it is gone. The model has no mechanism to "remember" it unless you re-inject the information.

This is fundamentally different from how humans process information. We forget details but retain gist indefinitely. An LLM forgets nothing within the window and everything outside it.

The Math: 128K ≈ 96K Words

A common rule of thumb is 1 token ≈ 0.75 words in English (see Topic 1: What Is a Token?). So a 128K context window holds roughly 96,000 words — about the length of a novel. That sounds like a lot, but production prompts fill up fast:

Component	Typical Size
System prompt	500 – 4,000 tokens
Conversation history (10 turns)	2,000 – 20,000 tokens
RAG chunks (5 documents)	5,000 – 50,000 tokens
User query	50 – 2,000 tokens
Output reserve	4,096 – 16,384 tokens

Output Eats the Same Budget

A detail often missed by beginners: generated output tokens consume context window space. If you have 128K tokens and your input uses 120K, the model can only produce 8K tokens of output before hitting the limit. Many APIs enforce a separate max_tokens parameter, but the sum of input + output can never exceed the window.

Lost in the Middle

Research shows that LLMs attend most strongly to information at the beginning and end of the context window. Information buried in the middle is more likely to be ignored — the "lost in the middle" effect. This means context window management isn't just about fitting tokens, but about positioning them strategically.

Longer ≠ Better

Larger context windows bring diminishing returns and real costs. Attention computation scales O(n²) with sequence length, so doubling the context quadruples attention cost. Adding irrelevant context can actually decrease accuracy by diluting signal. Smart retrieval (see Topic 10: Token Budgeting in Production) consistently outperforms brute-force context stuffing.

→ Context is expensive real estate — every token must earn its place, and more context can hurt if it's noise.

Python Example

import tiktoken

def check_context_budget(system, history, documents, query,
                          model="gpt-4", max_output=4096):
    """Check whether a prompt fits the context window."""
    enc = tiktoken.encoding_for_model(model)
    window_sizes = {
        "gpt-4": 8192,
        "gpt-4-turbo": 128000,
        "gpt-4o": 128000,
        "claude-3-opus": 200000,
    }
    window = window_sizes.get(model, 8192)

    parts = {
        "system": len(enc.encode(system)),
        "history": sum(len(enc.encode(m)) for m in history),
        "documents": sum(len(enc.encode(d)) for d in documents),
        "query": len(enc.encode(query)),
        "output_reserve": max_output,
    }
    total = sum(parts.values())
    remaining = window - total

    return {
        "parts": parts,
        "total": total,
        "window": window,
        "remaining": remaining,
        "fits": remaining >= 0,
    }

# Usage
result = check_context_budget(
    system="You are a helpful assistant...",
    history=["Hello", "Hi! How can I help?"],
    documents=["Doc content here..."],
    query="Summarize the document",
)
print(f"Fits: {result['fits']}, Remaining: {result['remaining']}")

Follow-up Questions

What exactly is the "lost in the middle" effect and how bad is it?

Research by Liu et al. (2023) showed that LLMs retrieve facts near the beginning or end of the context far more accurately than facts buried in the middle. In some tests, accuracy dropped by 20-30% for middle-positioned information. The practical fix is to place the most critical context at the start or end of your prompt.

How does context length affect latency?

Self-attention scales O(n²) with sequence length, so doubling context roughly quadruples attention computation. In practice, time-to-first-token increases noticeably beyond ~32K tokens, and total generation time grows because every new output token must attend to all prior tokens. Some providers use techniques like FlashAttention to reduce the constant factor but cannot change the fundamental scaling.

What is the difference between effective and theoretical context length?

The theoretical context length is the maximum tokens the architecture supports (e.g., 128K). The effective context length is the range over which the model actually uses information reliably — often significantly shorter. Needle-in-a-haystack benchmarks show many models degrade well before hitting their theoretical limits.

How do different models' context windows compare?

As of 2025, GPT-4o supports 128K tokens, Claude 3.5 supports 200K, Gemini 1.5 Pro supports up to 1M, and Llama 3 variants range from 8K to 128K. However, bigger is not always better — effective utilization, pricing per token, and attention quality vary widely. A model with a well-used 32K window can outperform one with a poorly-attended 200K window.

7

Cost & Latency

Every token you send and receive has a price. Input tokens are metered, output tokens cost 2-4x more, and latency grows with context length. Understanding the meter is the first step to controlling the bill.

💡 Token count is to LLM cost what kilowatt-hours are to your electric bill — the meter runs constantly, and some appliances are hungrier.

Prompt A 0 tokens

Tokens: 0

Input cost: $0.00

Prompt B 0 tokens

Tokens: 0

Output cost: $0.00

Model pricing tier: GPT-4o

Attention Cost: O(n²) scaling with sequence length

Where the Money Goes

LLM APIs charge per token, with separate rates for input (prompt) and output (completion). Here's a comparison across popular models (see Topic 2: Tokenizer Algorithms for how tokens are counted):

Model	Input ($/1M tokens)	Output ($/1M tokens)	Ratio
GPT-3.5 Turbo	$0.50	$1.50	3x
GPT-4o	$2.50	$10.00	4x
GPT-4 Turbo	$10.00	$30.00	3x
Claude 3.5 Sonnet	$3.00	$15.00	5x
Claude 3 Opus	$15.00	$75.00	5x

Why Output Costs More

Input tokens are processed in parallel through the transformer — the entire prompt is evaluated in one forward pass. Output tokens, however, are generated autoregressively: one at a time, each requiring a full forward pass through the model. This sequential generation is far more compute-intensive per token, which is why providers charge a premium.

Additionally, each output token must attend to all previous tokens (input + already-generated output), so the cost per token increases as the response gets longer.

Token-Efficient Prompting

The most impactful cost optimization is controlling output length. Techniques include:

Constrained output formats: Ask for JSON, CSV, or structured data instead of prose
Max token limits: Set max_tokens to a reasonable ceiling
Explicit length instructions: "Answer in 2-3 sentences" or "List the top 5 only"
System prompt optimization: Remove redundant instructions, use concise phrasing
Prompt caching: Many providers cache repeated prompt prefixes at reduced rates (see Topic 10: Token Budgeting in Production)

→ Output tokens cost 2-4x more than input — controlling response format is the highest-leverage cost optimization.

Python Example

import tiktoken

# Pricing per 1M tokens (input, output)
PRICING = {
    "gpt-3.5-turbo": (0.50, 1.50),
    "gpt-4o":         (2.50, 10.00),
    "gpt-4-turbo":   (10.00, 30.00),
}

def estimate_cost(prompt, expected_output_tokens, model="gpt-4o"):
    """Estimate API call cost in dollars."""
    enc = tiktoken.encoding_for_model(model)
    input_tokens = len(enc.encode(prompt))
    in_rate, out_rate = PRICING[model]

    input_cost  = input_tokens * in_rate / 1_000_000
    output_cost = expected_output_tokens * out_rate / 1_000_000

    return {
        "input_tokens": input_tokens,
        "output_tokens": expected_output_tokens,
        "input_cost": round(input_cost, 6),
        "output_cost": round(output_cost, 6),
        "total_cost": round(input_cost + output_cost, 6),
    }

# Compare verbose vs concise prompting
verbose = "Please provide a very detailed and comprehensive analysis..."
concise = "Analyze briefly in JSON: {sentiment, topics, action_items}"

print("Verbose:", estimate_cost(verbose, 2000))
print("Concise:", estimate_cost(concise, 200))

Follow-up Questions

How does prompt caching reduce costs?

Providers like Anthropic and OpenAI cache the KV (key-value) attention states of repeated prompt prefixes. If your system prompt and context stay the same across calls, cached input tokens are billed at 50-90% less than full price. The key requirement is that the cached portion must be an exact prefix match — any change invalidates the cache.

What are the best strategies for reducing output token usage?

The most effective strategies are: requesting structured output (JSON/CSV instead of prose), setting explicit max_tokens limits, asking for concise formats ("bullet points, not paragraphs"), and using function calling which constrains output to a schema. In production, structured output can reduce output tokens by 60-80% compared to free-form responses.

Does batching requests save money?

Most providers offer batch APIs at 50% discount (e.g., OpenAI's Batch API). The trade-off is latency — batch requests complete within 24 hours rather than seconds. For non-real-time workloads like data processing, classification, or content generation, batching is the single easiest cost reduction available.

Why is the output-to-input price ratio so high?

Input tokens are processed in parallel via a single forward pass, while output tokens require sequential autoregressive generation — each token needs its own forward pass. This means generating 100 output tokens requires roughly 100x the compute of processing 100 input tokens. The 2-5x price ratio actually understates the compute difference because providers amortize costs across users.

8

Exceeding the Context

When your input exceeds the context window, the model cannot process it. You need a strategy — truncation, sliding windows, or retrieval-based selection — to fit the most relevant information into the available space.

💡 Overflow is like fitting a semester of notes onto one exam cheat sheet — you have to choose what makes the cut.

Context limit: 8,000

Included: 0 tokens

Excluded: 0 tokens

At risk: 0 tokens

Truncation: Keeps documents in order, cuts off at the limit. Simple but loses tail content.

Why Bigger Windows Don't Solve It

It's tempting to think that ever-larger context windows will eliminate overflow problems. They won't, for three reasons:

Cost scales linearly with input size — filling a 200K window costs 25x more than filling 8K
Latency scales quadratically — attention computation grows O(n²) with context length
Quality degrades — the "lost in the middle" effect (see Topic 6: Context Window) means irrelevant context actively hurts accuracy

Even with a 1M-token window, the question is never "can it fit?" but "should it fit?"

Three Strategies

Strategy	Mechanism	Pros	Cons
Truncation	Cut content beyond limit	Simplest to implement	Loses recent/tail content blindly
Sliding Window	Keep most recent N tokens	Preserves recency	Loses early context and system prompt risks
Smart Retrieval	Embed & retrieve relevant chunks	Best accuracy, query-aware	Requires vector store infrastructure

Retrieval-Based Selection

The most effective approach is retrieval-augmented generation (RAG): embed all documents into a vector store, then at query time retrieve only the most semantically relevant chunks. This approach (see Topic 9: Truncation vs Sliding Windows vs Summarization) ensures every token in the context is working toward answering the user's question.

Key implementation details:

Chunk documents into 200-500 token segments with overlap
Embed chunks using a model like text-embedding-3-small
At query time, retrieve top-k chunks by cosine similarity
Always reserve space for system prompt + query + output

→ Retrieval-based chunk selection almost always outperforms brute-force context filling.

Python Example

from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    tokens: int
    relevance: float  # 0-1 similarity score

def select_chunks(chunks, budget, strategy="retrieval"):
    """Select chunks that fit within token budget."""
    if strategy == "truncate":
        # Take chunks in order until budget exhausted
        selected, used = [], 0
        for c in chunks:
            if used + c.tokens <= budget:
                selected.append(c)
                used += c.tokens
        return selected

    elif strategy == "sliding":
        # Take most recent chunks first
        selected, used = [], 0
        for c in reversed(chunks):
            if used + c.tokens <= budget:
                selected.insert(0, c)
                used += c.tokens
        return selected

    elif strategy == "retrieval":
        # Sort by relevance, pack greedily
        ranked = sorted(chunks, key=lambda c: c.relevance, reverse=True)
        selected, used = [], 0
        for c in ranked:
            if used + c.tokens <= budget:
                selected.append(c)
                used += c.tokens
        return selected

Follow-up Questions

When is truncation better than summarization?

Truncation is better when you need exact quotes, precise data, or verbatim content — summarization loses fidelity. It's also appropriate for simple tasks where the answer is likely in the first portion of the document, or when latency matters too much to run a summarization step first.

How does hierarchical summarization work for long documents?

Hierarchical summarization splits a document into chunks, summarizes each chunk independently, then summarizes the summaries. This creates a tree of increasingly compressed representations. The key advantage is that it can handle documents of arbitrary length while preserving the overall structure. The trade-off is multiple LLM calls and cumulative information loss at each level.

What is the map-reduce approach to context overflow?

In map-reduce, you "map" the same question to each chunk independently (getting partial answers), then "reduce" by combining those partial answers into a final response. This works well for aggregation tasks (counting, listing, comparing) but poorly for tasks requiring cross-chunk reasoning. It's popular in frameworks like LangChain for question-answering over large document sets.

9

Truncation vs Sliding Windows vs Summarization

When content exceeds the context window, three core strategies compete: truncation (cut the tail), sliding windows (move the spotlight), and summarization (compress to essentials). Each trades off differently between simplicity, recency, and fidelity.

💡 Truncation is a guillotine. Sliding windows are a spotlight scanning a dark room. Summarization captures the gist but loses the quotes.

Token budget: 4 blocks

Truncation

Keep first N, drop the rest

Sliding Window

Focus moves through content

Summarization

Compress to essential meaning

Strategy Comparison

Dimension	Truncation	Sliding Window	Summarization
Complexity	Trivial	Low	Medium-High
Preserves recency	No (keeps oldest)	Yes	Partially
Preserves exact text	Yes (what's kept)	Yes (what's visible)	No
Extra LLM calls	0	0	1+ per summary
Info loss pattern	Total loss of tail	Total loss of head	Distributed lossy compression
Best for	Simple queries, first-pass	Chat/dialogue	Long documents, multi-turn

When to Use Each

Truncation when the answer is likely near the beginning (e.g., abstracts, headers) and you need zero added latency
Sliding Window for conversational history where recent turns matter most (see Topic 8: Exceeding the Context) — most chat applications use this by default
Summarization when you need the full document's gist — legal review, research synthesis, multi-turn agents that must remember early decisions

The Production Hybrid

Real production systems rarely use a single strategy. The winning pattern combines all three (see Topic 10: Token Budgeting in Production):

System prompt — always kept verbatim (never truncated or summarized)
Old conversation history — progressively summarized into compressed blocks
Retrieved context — RAG chunks selected by relevance, truncated if individual chunks are too long
Recent turns — kept verbatim in a sliding window of the last 3-5 exchanges
User query + output reserve — always protected at full fidelity

This hybrid ensures that the model has structure (system prompt), gist (summaries), evidence (RAG), and recency (sliding window) — all within budget.

→ No single strategy wins — production systems summarize old history, retrieve fresh chunks, and always protect system prompt and query.

Python Example

from typing import List

class ConversationManager:
    """Hybrid strategy: summarize old, keep recent, always protect system."""

    def __init__(self, window_size=128000, recent_turns=5):
        self.window_size = window_size
        self.recent_turns = recent_turns
        self.system_prompt = ""
        self.summary = ""       # Compressed old history
        self.messages = []      # Full message list

    def build_context(self, query: str, rag_chunks: List[str]) -> List[dict]:
        """Assemble context within budget."""
        output_reserve = 4096
        budget = self.window_size - output_reserve

        # 1. System prompt (always included)
        context = [{"role": "system", "content": self.system_prompt}]
        used = self.count_tokens(self.system_prompt)

        # 2. Summary of old history
        if self.summary:
            context.append({"role": "system",
                "content": f"Previous conversation summary: {self.summary}"})
            used += self.count_tokens(self.summary)

        # 3. RAG chunks (by relevance, until budget)
        rag_budget = int((budget - used) * 0.5)
        rag_used = 0
        for chunk in rag_chunks:
            ct = self.count_tokens(chunk)
            if rag_used + ct <= rag_budget:
                context.append({"role": "system", "content": chunk})
                rag_used += ct
        used += rag_used

        # 4. Recent turns (sliding window)
        recent = self.messages[-self.recent_turns * 2:]
        for msg in recent:
            used += self.count_tokens(msg["content"])
        context.extend(recent)

        # 5. Current query
        context.append({"role": "user", "content": query})
        return context

    def count_tokens(self, text):
        return max(1, len(text) // 4)  # Approximation

Follow-up Questions

How do you size the overlap in a sliding window?

Overlap ensures continuity between window positions. A typical overlap is 10-20% of the window size. For conversation, this means keeping the last 1-2 turns from the previous window. For document chunking, 50-100 token overlap prevents splitting sentences or ideas. Too little overlap risks losing context at boundaries; too much wastes budget on redundancy.

What is recursive summarization and when is it useful?

Recursive summarization summarizes a document in chunks, then summarizes those summaries, repeating until you reach a target length. It's useful for very long documents (books, legal filings) where a single-pass summary would itself exceed the context window. The risk is compounding information loss — each level of recursion discards more detail, so it works best for capturing themes rather than specific facts.

How does incremental summarization differ from batch summarization?

In incremental summarization, you update the summary after each new message: "Given this summary of conversation so far and this new message, produce an updated summary." This avoids re-processing the entire history but can drift over time. Batch summarization processes all messages at once for higher fidelity but requires the full history to fit in context. Incremental is better for real-time chat; batch is better for periodic compaction.

10

Token Budgeting in Production

Token budgeting means pre-allocating context window space to each component — system prompt, history, retrieved context, query, and output reserve — so your application never overflows in production. It is the difference between a demo and a reliable system.

💡 Budget tokens like money — pay the bills first (system prompt, output reserve), then allocate leftovers by priority.

System Prompt 5%

History Summary 10%

RAG Chunks 35%

Recent Turns 20%

Query 5%

Output Reserve 25%

20-Request Simulation: Token Usage Per Request

Budget Rules of Thumb

Before writing any application code, allocate your context window (see Topic 6: Context Window):

Fixed costs first: System prompt + output reserve. These are non-negotiable — always reserve them at their maximum expected size.
Query overhead: Leave room for the user's actual question (typically 1-5% of window).
Variable budget: What remains is split between history and retrieved context, with retrieved context usually taking priority.
Safety margin: Reserve 5-10% as buffer — token estimates are imprecise, and edge cases will surprise you.

Common Mistakes

Mistake	Consequence	Fix
No output reserve	Truncated responses	Always reserve 10-25% for output
Unbounded history	Context overflow on long conversations	Sliding window + summarization
Stuffing max context	Higher cost, "lost in middle" (see Topic 7: Cost & Latency)	Retrieve only relevant chunks
Ignoring token counting	Silent truncation by API	Count tokens before every call
Hardcoded budgets	Breaks on model/window changes	Percentage-based allocation

Monitoring in Production

Token budgeting doesn't end at design time. In production, you need to monitor (see Topic 9: Truncation vs Sliding Windows vs Summarization):

Utilization rate: How much of the window is used on average? If it's consistently >90%, you're one edge case from overflow.
Overflow frequency: How often do requests exceed the budget? Even 0.1% can mean hundreds of failed requests per day at scale.
Component distribution: Is one component (e.g., RAG chunks) dominating? Rebalance if so.
Cost per request: Track input + output tokens per request to catch regressions early.

→ Token budgeting is reliability engineering — overflow failures are as catastrophic as latency spikes and far less visible.

Python Example

from dataclasses import dataclass, field
from typing import Dict

@dataclass
class TokenBudget:
    """Manage token budget for production LLM calls."""
    window_size: int = 128000
    allocations: Dict[str, float] = field(default_factory=lambda: {
        "system_prompt":  0.05,
        "history_summary": 0.10,
        "rag_chunks":     0.35,
        "recent_turns":   0.15,
        "query":          0.05,
        "output_reserve": 0.25,
        "safety_margin":  0.05,
    })

    def get_budget(self, component: str) -> int:
        """Get token budget for a component."""
        return int(self.window_size * self.allocations[component])

    def validate(self, actuals: Dict[str, int]) -> Dict:
        """Check actual usage against budget."""
        warnings = []
        total_used = sum(actuals.values())

        for comp, tokens in actuals.items():
            budget = self.get_budget(comp)
            if tokens > budget:
                warnings.append(
                    f"{comp}: {tokens} tokens exceeds budget of {budget}"
                )

        return {
            "total_used": total_used,
            "window_size": self.window_size,
            "utilization": round(total_used / self.window_size, 3),
            "overflow": total_used > self.window_size,
            "warnings": warnings,
        }

# Usage
budget = TokenBudget(window_size=128000)
print("RAG budget:", budget.get_budget("rag_chunks"))  # 44800

result = budget.validate({
    "system_prompt": 3200,
    "history_summary": 8000,
    "rag_chunks": 50000,  # Over budget!
    "recent_turns": 12000,
    "query": 500,
})
print("Utilization:", result["utilization"])

Follow-up Questions

How should multi-turn agents handle growing context?

Multi-turn agents should use a tiered compression approach: keep recent tool calls and observations verbatim, summarize older turns progressively, and maintain a "scratchpad" of key decisions. The critical insight is that agent loops can run 20-50+ turns — without compression, you overflow within minutes. Set a hard budget per turn and trigger summarization when utilization exceeds 80%.

What metrics should I monitor for token budgeting in production?

Track four key metrics: p95 utilization (how close to the limit your busiest requests get), overflow rate (percentage of requests that exceed budget), cost per request (input + output tokens times price), and component breakdown (which budget category uses the most tokens). Set alerts when p95 utilization exceeds 85% or overflow rate exceeds 0.01%.

How does prompt caching interact with token budgeting?

Prompt caching reduces cost but not context window consumption — cached tokens still occupy window space. However, it changes the cost optimization calculus: a large, stable system prompt that would normally be a budgeting concern becomes cheap to send repeatedly. This means you can afford richer system prompts (instructions, examples, schemas) as long as they stay constant across requests, since cached prefixes cost 50-90% less.

What is graceful degradation for token overflow?

Graceful degradation means having fallback strategies when the primary budget is exceeded. A typical chain: first, reduce RAG chunks (drop lowest-relevance); second, compress history more aggressively; third, truncate retrieved context; finally, if still over budget, return a helpful error explaining the limitation. Never silently truncate — always log the event and degrade in a predictable, prioritized order.