What tokens are, how vocabularies are built, and the structural markers that shape model behavior.
What Is a Token?
From Text to Numbers
Large language models cannot read text directly. Before a single word reaches the neural network, a tokenizer splits the input into a sequence of tokens — integer IDs that map into the model's learned vocabulary. Each token ID points to a row in an embedding matrix, producing a dense vector the model can actually compute with.
The vocabulary is fixed at training time. GPT-4's tokenizer (cl100k_base) has roughly 100,000 entries; Claude's is similar in scale. Every string you send is decomposed into a sequence drawn from that finite set.
Token Economy
Tokens are the billing unit of every major LLM API. When a provider quotes "$3 per million input tokens," they mean the tokenized length — not characters, not words. Because different text types tokenize at different densities, the same number of characters can cost wildly different amounts:
- English prose: ~1 token per 4 characters
- Python code: ~1 token per 3 characters (more whitespace, symbols)
- JSON with long keys: ~1 token per 3.5 characters
- CJK text: ~1 token per 1.5 characters (each character often becomes its own token)
Why This Matters
Understanding tokens unlocks three practical skills:
- Cost estimation — You can predict API spend before sending a request.
- Prompt engineering — You know exactly how much room you have inside a Topic 7: Context Windows.
- Debugging odd behavior — Spelling errors, hallucinated words, and strange code completions often trace back to how the tokenizer split the input. See Topic 2: Tokens vs Words for the word/token mismatch that causes most surprises.
Python Example
import tiktoken
# Load the tokenizer used by GPT-4 / ChatGPT
enc = tiktoken.get_encoding("cl100k_base")
text = "The quick brown fox jumps over the lazy dog."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Count: {len(tokens)}")
# Decode each token back to its string piece
for tid in tokens:
print(f" {tid:>6} -> {enc.decode([tid])!r}")
How big is a typical vocabulary, and why not just use one token per character?
Do different models use different tokenizers?
What happens when the model encounters a word it has never seen?
Does whitespace count as tokens?
Tokens vs Words
Why the Mismatch
Words are a human concept with fuzzy boundaries. Is "don't" one word or two? Is "state-of-the-art" one word or four? Tokenizers don't care about these debates. They split text according to statistical patterns learned during training via algorithms like Topic 3: Byte-Pair Encoding.
Common words like "the", "is", and "hello" usually map to a single token. But longer or rarer words get split into subword pieces. "Uncharacteristically" might become four tokens: ["Un", "character", "istic", "ally"]. Meanwhile, frequent multi-character sequences like " the" (with a leading space) or "\n\n" are single tokens.
Fertility: The Hidden Cost Multiplier
Fertility is the average number of tokens per word. English prose typically has a fertility around 1.2–1.4, meaning 100 words become roughly 120–140 tokens. But this ratio shifts dramatically:
| Input Type | Typical Fertility | Why |
|---|---|---|
| Simple English | 1.1–1.3 | Common words are single tokens |
| Technical English | 1.4–1.8 | Domain jargon splits into subwords |
| Python code | 1.8–2.5 | Symbols, indentation, identifiers |
| JSON | 2.0–3.0 | Brackets, colons, quoted keys |
| German | 2.0–3.5 | Long compound words |
| Korean / Thai | 3.0–5.0+ | Each syllable or character may be its own token |
This has direct cost implications. Sending the same semantic content in Korean can cost 3–4x more in tokens than English. For multilingual applications, fertility analysis is essential for budgeting. See Topic 7: Context Windows for how fertility impacts context budgets.
Python Example
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
samples = {
"English": "The quick brown fox jumps over the lazy dog.",
"Code": "def fibonacci(n):\n return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
"JSON": '{"users": [{"name": "Alice", "age": 30}]}',
}
for label, text in samples.items():
words = text.split()
tokens = enc.encode(text)
fertility = len(tokens) / len(words)
print(f"{label:>8}: {len(words)} words -> {len(tokens)} tokens (fertility {fertility:.2f})")
Does this mean non-English languages are more expensive to use with LLMs?
Can I write prompts that use fewer tokens?
Should I chunk text by tokens instead of by words or characters?
Byte-Pair Encoding (BPE)
The UNK Problem
Early NLP systems used fixed word-level vocabularies. Any word not in the vocabulary became <UNK> (unknown). For a 50,000-word vocabulary, even common misspellings, new slang, or compound words in German would vanish into UNK. BPE solved this completely: because it can decompose any string into subword pieces (and ultimately into individual bytes), there is no such thing as an unknown input.
The Algorithm Step by Step
- Initialize: Start with a vocabulary of all individual characters (or bytes) found in the training corpus.
- Count pairs: Scan all tokens in the corpus and count every adjacent pair (bigram).
- Merge the top pair: Take the most frequent pair and merge it into a single new token. Add this token to the vocabulary.
- Repeat: Go back to step 2. Continue until you reach the target vocabulary size (e.g., 50,000 merges).
The merge rules are saved in order. At inference time, the tokenizer applies the same merges in the same order to any new text, guaranteeing deterministic tokenization.
Vocabulary Size Tradeoffs
| Vocab Size | Pros | Cons |
|---|---|---|
| Small (~8K) | Compact model, fewer parameters in embedding layer | Longer sequences, more tokens per word |
| Medium (~32K) | Good balance for most languages | May still split technical terms |
| Large (~100K+) | Shorter sequences, common phrases as single tokens | Larger embedding matrix, sparser training signal per token |
Which Models Use BPE?
BPE (and its variant byte-level BPE) is used by GPT-2, GPT-3, GPT-4, Claude, LLaMA, Mistral, and most modern LLMs. The main alternative is Topic 4: SentencePiece, which uses either BPE or unigram mode with a different pre-tokenization approach.
Python Example
# Simplified BPE training loop
from collections import Counter
def train_bpe(corpus, num_merges):
# Split words into character lists
words = [list(w) + ['</w>'] for w in corpus.split()]
merges = []
for i in range(num_merges):
# Count all adjacent pairs
pairs = Counter()
for word in words:
for j in range(len(word) - 1):
pairs[(word[j], word[j+1])] += 1
if not pairs:
break
best = max(pairs, key=pairs.get)
merges.append(best)
print(f"Merge {i+1}: {best[0]} + {best[1]} -> {best[0]+best[1]}")
# Apply merge to all words
for word in words:
j = 0
while j < len(word) - 1:
if (word[j], word[j+1]) == best:
word[j:j+2] = [best[0] + best[1]]
else:
j += 1
return merges
merges = train_bpe("low lower newest widest low low", 10)
What is byte-level BPE, and how does it differ from character-level BPE?
Does a larger vocabulary always produce better results?
How does BPE differ from the unigram model used in SentencePiece?
Can BPE merges cross word boundaries?
SentencePiece
Why Whitespace Assumptions Break
Traditional NLP pipelines assume words are separated by spaces. This works for English, French, and German, but fails catastrophically for:
- Japanese: "私は学生です" has no spaces at all
- Thai: "สวัสดีครับ" uses spaces only between sentences, not words
- Chinese: Characters map to morphemes, not words — word boundaries are ambiguous
Building a separate tokenizer for each language is expensive, error-prone, and creates a maintenance nightmare. SentencePiece eliminates this by treating all text as a sequence of Unicode characters (or bytes), with whitespace represented as a special character (▁) rather than used as a delimiter. See Topic 3: Byte-Pair Encoding for the underlying merge algorithm.
How SentencePiece Works
- Normalize: Apply NFKC Unicode normalization to canonicalize characters (e.g., full-width "A" becomes normal "A").
- Escape whitespace: Replace spaces with the meta-symbol ▁ (U+2581). The input becomes one continuous string.
- Train or apply: Use either BPE (bottom-up merging) or the unigram language model (top-down pruning) to build or apply the vocabulary.
- Output: A sequence of token IDs and a single .model file that works for any language.
BPE vs Unigram Mode
| Feature | BPE Mode | Unigram Mode |
|---|---|---|
| Direction | Bottom-up (merge pairs) | Top-down (prune vocabulary) |
| Determinism | One tokenization per input | Multiple possible tokenizations, picks best by likelihood |
| Regularization | Not built-in | Subword regularization (sample different tokenizations during training) |
| Used by | LLaMA, Mistral | T5, mBART, ALBERT |
Python Example
import sentencepiece as spm
# Train a SentencePiece model on a text file
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='my_tokenizer',
vocab_size=8000,
model_type='bpe', # or 'unigram'
character_coverage=0.9995, # cover 99.95% of characters
)
# Load and use
sp = spm.SentencePieceProcessor(model_file='my_tokenizer.model')
text = "SentencePiece works for any language!"
pieces = sp.encode(text, out_type=str)
ids = sp.encode(text, out_type=int)
print(f"Pieces: {pieces}")
print(f"IDs: {ids}")
print(f"Decoded: {sp.decode(ids)}")
How does SentencePiece compare to tiktoken (used by OpenAI)?
What is NFKC normalization, and can it cause problems?
Can I add custom pre-tokenization rules to SentencePiece?
Special Tokens
Token Reference Table
| Token | Name | Purpose | Used By |
|---|---|---|---|
<|begin_of_text|> | BOS | Marks the very start of the input sequence | LLaMA 3, Mistral |
<|end_of_text|> | EOS | Signals the model to stop generating | LLaMA 3, GPT |
<|im_start|> | Role start | Begins a new message with a role (system/user/assistant) | ChatML format |
<|im_end|> | Role end | Ends the current message | ChatML format |
[INST] / [/INST] | Instruction | Wraps user instructions | LLaMA 2, Mistral |
<|pad|> | Padding | Fills unused positions in fixed-length batches | Most models |
<|sep|> | Separator | Separates segments (e.g., document from query) | BERT, T5 |
Why Fine-Tuning Goes Wrong
The most common fine-tuning mistake is getting special tokens wrong. If your training data uses a different chat template than the base model expects, the model cannot tell where one message ends and another begins. Symptoms include:
- The model echoes the prompt back instead of responding
- It generates text attributed to the wrong role
- It refuses to stop generating (missing EOS)
- It produces garbled output at message boundaries
Always verify that your fine-tuning data uses the exact same special token format as the base model's chat template.
Chat Templates
Each model family defines a chat template — the exact sequence of special tokens that wraps each message. The Hugging Face transformers library stores these as Jinja2 templates in the tokenizer config. When you call tokenizer.apply_chat_template(), it handles the formatting automatically.
Getting this right is essential. The same model will behave completely differently depending on whether special tokens are correctly placed. See Topic 1: What Is a Token? for how these special tokens are just integer IDs in the same vocabulary as regular tokens.
Python Example
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are special tokens?"},
{"role": "assistant", "content": "Special tokens are..."},
]
# apply_chat_template handles all special tokens automatically
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(formatted)
# Inspect the special tokens in the vocabulary
print("Special tokens:", tokenizer.all_special_tokens)
print("Special IDs:", tokenizer.all_special_ids)
How can I debug special token issues in my prompts?
tokenizer.encode() and decode each token individually to see the exact sequence. Look for missing or duplicated BOS/EOS tokens, or role markers that don't match the model's expected template. Most issues become obvious when you inspect the raw token ID sequence.What is ChatML, and why does it matter?
<|im_start|> and <|im_end|> tokens to delimit messages. Many open-source models (Qwen, Yi, OpenChat) adopted it, making it a de facto standard. If you fine-tune on ChatML data but deploy with a different template, the model will not understand role boundaries.Do special tokens affect the attention mask?
Now that you understand how tokens work, this section covers the engineering constraints they create — context limits, cost, overflow handling, and production budgeting.
Context Window
Attention Scope, Not Memory
Despite what the name suggests, the context window is not memory in any durable sense. It is the attention scope β the set of tokens the model's self-attention layers can attend to when generating the next token. Once a conversation ends or a token falls outside the window, it is gone. The model has no mechanism to "remember" it unless you re-inject the information.
This is fundamentally different from how humans process information. We forget details but retain gist indefinitely. An LLM forgets nothing within the window and everything outside it.
The Math: 128K β 96K Words
A common rule of thumb is 1 token β 0.75 words in English (see Topic 1: What Is a Token?). So a 128K context window holds roughly 96,000 words β about the length of a novel. That sounds like a lot, but production prompts fill up fast:
| Component | Typical Size |
|---|---|
| System prompt | 500 β 4,000 tokens |
| Conversation history (10 turns) | 2,000 β 20,000 tokens |
| RAG chunks (5 documents) | 5,000 β 50,000 tokens |
| User query | 50 β 2,000 tokens |
| Output reserve | 4,096 β 16,384 tokens |
Output Eats the Same Budget
A detail often missed by beginners: generated output tokens consume context window space. If you have 128K tokens and your input uses 120K, the model can only produce 8K tokens of output before hitting the limit. Many APIs enforce a separate max_tokens parameter, but the sum of input + output can never exceed the window.
Lost in the Middle
Research shows that LLMs attend most strongly to information at the beginning and end of the context window. Information buried in the middle is more likely to be ignored β the "lost in the middle" effect. This means context window management isn't just about fitting tokens, but about positioning them strategically.
Longer β Better
Larger context windows bring diminishing returns and real costs. Attention computation scales O(nΒ²) with sequence length, so doubling the context quadruples attention cost. Adding irrelevant context can actually decrease accuracy by diluting signal. Smart retrieval (see Topic 10: Token Budgeting in Production) consistently outperforms brute-force context stuffing.
Python Example
import tiktoken
def check_context_budget(system, history, documents, query,
model="gpt-4", max_output=4096):
"""Check whether a prompt fits the context window."""
enc = tiktoken.encoding_for_model(model)
window_sizes = {
"gpt-4": 8192,
"gpt-4-turbo": 128000,
"gpt-4o": 128000,
"claude-3-opus": 200000,
}
window = window_sizes.get(model, 8192)
parts = {
"system": len(enc.encode(system)),
"history": sum(len(enc.encode(m)) for m in history),
"documents": sum(len(enc.encode(d)) for d in documents),
"query": len(enc.encode(query)),
"output_reserve": max_output,
}
total = sum(parts.values())
remaining = window - total
return {
"parts": parts,
"total": total,
"window": window,
"remaining": remaining,
"fits": remaining >= 0,
}
# Usage
result = check_context_budget(
system="You are a helpful assistant...",
history=["Hello", "Hi! How can I help?"],
documents=["Doc content here..."],
query="Summarize the document",
)
print(f"Fits: {result['fits']}, Remaining: {result['remaining']}")What exactly is the "lost in the middle" effect and how bad is it?
How does context length affect latency?
What is the difference between effective and theoretical context length?
How do different models' context windows compare?
Cost & Latency
Where the Money Goes
LLM APIs charge per token, with separate rates for input (prompt) and output (completion). Here's a comparison across popular models (see Topic 2: Tokenizer Algorithms for how tokens are counted):
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Ratio |
|---|---|---|---|
| GPT-3.5 Turbo | $0.50 | $1.50 | 3x |
| GPT-4o | $2.50 | $10.00 | 4x |
| GPT-4 Turbo | $10.00 | $30.00 | 3x |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 5x |
| Claude 3 Opus | $15.00 | $75.00 | 5x |
Why Output Costs More
Input tokens are processed in parallel through the transformer β the entire prompt is evaluated in one forward pass. Output tokens, however, are generated autoregressively: one at a time, each requiring a full forward pass through the model. This sequential generation is far more compute-intensive per token, which is why providers charge a premium.
Additionally, each output token must attend to all previous tokens (input + already-generated output), so the cost per token increases as the response gets longer.
Token-Efficient Prompting
The most impactful cost optimization is controlling output length. Techniques include:
- Constrained output formats: Ask for JSON, CSV, or structured data instead of prose
- Max token limits: Set
max_tokensto a reasonable ceiling - Explicit length instructions: "Answer in 2-3 sentences" or "List the top 5 only"
- System prompt optimization: Remove redundant instructions, use concise phrasing
- Prompt caching: Many providers cache repeated prompt prefixes at reduced rates (see Topic 10: Token Budgeting in Production)
Python Example
import tiktoken
# Pricing per 1M tokens (input, output)
PRICING = {
"gpt-3.5-turbo": (0.50, 1.50),
"gpt-4o": (2.50, 10.00),
"gpt-4-turbo": (10.00, 30.00),
}
def estimate_cost(prompt, expected_output_tokens, model="gpt-4o"):
"""Estimate API call cost in dollars."""
enc = tiktoken.encoding_for_model(model)
input_tokens = len(enc.encode(prompt))
in_rate, out_rate = PRICING[model]
input_cost = input_tokens * in_rate / 1_000_000
output_cost = expected_output_tokens * out_rate / 1_000_000
return {
"input_tokens": input_tokens,
"output_tokens": expected_output_tokens,
"input_cost": round(input_cost, 6),
"output_cost": round(output_cost, 6),
"total_cost": round(input_cost + output_cost, 6),
}
# Compare verbose vs concise prompting
verbose = "Please provide a very detailed and comprehensive analysis..."
concise = "Analyze briefly in JSON: {sentiment, topics, action_items}"
print("Verbose:", estimate_cost(verbose, 2000))
print("Concise:", estimate_cost(concise, 200))How does prompt caching reduce costs?
What are the best strategies for reducing output token usage?
max_tokens limits, asking for concise formats ("bullet points, not paragraphs"), and using function calling which constrains output to a schema. In production, structured output can reduce output tokens by 60-80% compared to free-form responses.Does batching requests save money?
Why is the output-to-input price ratio so high?
Exceeding the Context
Why Bigger Windows Don't Solve It
It's tempting to think that ever-larger context windows will eliminate overflow problems. They won't, for three reasons:
- Cost scales linearly with input size β filling a 200K window costs 25x more than filling 8K
- Latency scales quadratically β attention computation grows O(n²) with context length
- Quality degrades β the "lost in the middle" effect (see Topic 6: Context Window) means irrelevant context actively hurts accuracy
Even with a 1M-token window, the question is never "can it fit?" but "should it fit?"
Three Strategies
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Truncation | Cut content beyond limit | Simplest to implement | Loses recent/tail content blindly |
| Sliding Window | Keep most recent N tokens | Preserves recency | Loses early context and system prompt risks |
| Smart Retrieval | Embed & retrieve relevant chunks | Best accuracy, query-aware | Requires vector store infrastructure |
Retrieval-Based Selection
The most effective approach is retrieval-augmented generation (RAG): embed all documents into a vector store, then at query time retrieve only the most semantically relevant chunks. This approach (see Topic 9: Truncation vs Sliding Windows vs Summarization) ensures every token in the context is working toward answering the user's question.
Key implementation details:
- Chunk documents into 200-500 token segments with overlap
- Embed chunks using a model like text-embedding-3-small
- At query time, retrieve top-k chunks by cosine similarity
- Always reserve space for system prompt + query + output
Python Example
from dataclasses import dataclass
@dataclass
class Chunk:
text: str
tokens: int
relevance: float # 0-1 similarity score
def select_chunks(chunks, budget, strategy="retrieval"):
"""Select chunks that fit within token budget."""
if strategy == "truncate":
# Take chunks in order until budget exhausted
selected, used = [], 0
for c in chunks:
if used + c.tokens <= budget:
selected.append(c)
used += c.tokens
return selected
elif strategy == "sliding":
# Take most recent chunks first
selected, used = [], 0
for c in reversed(chunks):
if used + c.tokens <= budget:
selected.insert(0, c)
used += c.tokens
return selected
elif strategy == "retrieval":
# Sort by relevance, pack greedily
ranked = sorted(chunks, key=lambda c: c.relevance, reverse=True)
selected, used = [], 0
for c in ranked:
if used + c.tokens <= budget:
selected.append(c)
used += c.tokens
return selectedWhen is truncation better than summarization?
How does hierarchical summarization work for long documents?
What is the map-reduce approach to context overflow?
Truncation vs Sliding Windows vs Summarization
Strategy Comparison
| Dimension | Truncation | Sliding Window | Summarization |
|---|---|---|---|
| Complexity | Trivial | Low | Medium-High |
| Preserves recency | No (keeps oldest) | Yes | Partially |
| Preserves exact text | Yes (what's kept) | Yes (what's visible) | No |
| Extra LLM calls | 0 | 0 | 1+ per summary |
| Info loss pattern | Total loss of tail | Total loss of head | Distributed lossy compression |
| Best for | Simple queries, first-pass | Chat/dialogue | Long documents, multi-turn |
When to Use Each
- Truncation when the answer is likely near the beginning (e.g., abstracts, headers) and you need zero added latency
- Sliding Window for conversational history where recent turns matter most (see Topic 8: Exceeding the Context) β most chat applications use this by default
- Summarization when you need the full document's gist β legal review, research synthesis, multi-turn agents that must remember early decisions
The Production Hybrid
Real production systems rarely use a single strategy. The winning pattern combines all three (see Topic 10: Token Budgeting in Production):
- System prompt β always kept verbatim (never truncated or summarized)
- Old conversation history β progressively summarized into compressed blocks
- Retrieved context β RAG chunks selected by relevance, truncated if individual chunks are too long
- Recent turns β kept verbatim in a sliding window of the last 3-5 exchanges
- User query + output reserve β always protected at full fidelity
This hybrid ensures that the model has structure (system prompt), gist (summaries), evidence (RAG), and recency (sliding window) β all within budget.
Python Example
from typing import List
class ConversationManager:
"""Hybrid strategy: summarize old, keep recent, always protect system."""
def __init__(self, window_size=128000, recent_turns=5):
self.window_size = window_size
self.recent_turns = recent_turns
self.system_prompt = ""
self.summary = "" # Compressed old history
self.messages = [] # Full message list
def build_context(self, query: str, rag_chunks: List[str]) -> List[dict]:
"""Assemble context within budget."""
output_reserve = 4096
budget = self.window_size - output_reserve
# 1. System prompt (always included)
context = [{"role": "system", "content": self.system_prompt}]
used = self.count_tokens(self.system_prompt)
# 2. Summary of old history
if self.summary:
context.append({"role": "system",
"content": f"Previous conversation summary: {self.summary}"})
used += self.count_tokens(self.summary)
# 3. RAG chunks (by relevance, until budget)
rag_budget = int((budget - used) * 0.5)
rag_used = 0
for chunk in rag_chunks:
ct = self.count_tokens(chunk)
if rag_used + ct <= rag_budget:
context.append({"role": "system", "content": chunk})
rag_used += ct
used += rag_used
# 4. Recent turns (sliding window)
recent = self.messages[-self.recent_turns * 2:]
for msg in recent:
used += self.count_tokens(msg["content"])
context.extend(recent)
# 5. Current query
context.append({"role": "user", "content": query})
return context
def count_tokens(self, text):
return max(1, len(text) // 4) # ApproximationHow do you size the overlap in a sliding window?
What is recursive summarization and when is it useful?
How does incremental summarization differ from batch summarization?
Token Budgeting in Production
Budget Rules of Thumb
Before writing any application code, allocate your context window (see Topic 6: Context Window):
- Fixed costs first: System prompt + output reserve. These are non-negotiable β always reserve them at their maximum expected size.
- Query overhead: Leave room for the user's actual question (typically 1-5% of window).
- Variable budget: What remains is split between history and retrieved context, with retrieved context usually taking priority.
- Safety margin: Reserve 5-10% as buffer β token estimates are imprecise, and edge cases will surprise you.
Common Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| No output reserve | Truncated responses | Always reserve 10-25% for output |
| Unbounded history | Context overflow on long conversations | Sliding window + summarization |
| Stuffing max context | Higher cost, "lost in middle" (see Topic 7: Cost & Latency) | Retrieve only relevant chunks |
| Ignoring token counting | Silent truncation by API | Count tokens before every call |
| Hardcoded budgets | Breaks on model/window changes | Percentage-based allocation |
Monitoring in Production
Token budgeting doesn't end at design time. In production, you need to monitor (see Topic 9: Truncation vs Sliding Windows vs Summarization):
- Utilization rate: How much of the window is used on average? If it's consistently >90%, you're one edge case from overflow.
- Overflow frequency: How often do requests exceed the budget? Even 0.1% can mean hundreds of failed requests per day at scale.
- Component distribution: Is one component (e.g., RAG chunks) dominating? Rebalance if so.
- Cost per request: Track input + output tokens per request to catch regressions early.
Python Example
from dataclasses import dataclass, field
from typing import Dict
@dataclass
class TokenBudget:
"""Manage token budget for production LLM calls."""
window_size: int = 128000
allocations: Dict[str, float] = field(default_factory=lambda: {
"system_prompt": 0.05,
"history_summary": 0.10,
"rag_chunks": 0.35,
"recent_turns": 0.15,
"query": 0.05,
"output_reserve": 0.25,
"safety_margin": 0.05,
})
def get_budget(self, component: str) -> int:
"""Get token budget for a component."""
return int(self.window_size * self.allocations[component])
def validate(self, actuals: Dict[str, int]) -> Dict:
"""Check actual usage against budget."""
warnings = []
total_used = sum(actuals.values())
for comp, tokens in actuals.items():
budget = self.get_budget(comp)
if tokens > budget:
warnings.append(
f"{comp}: {tokens} tokens exceeds budget of {budget}"
)
return {
"total_used": total_used,
"window_size": self.window_size,
"utilization": round(total_used / self.window_size, 3),
"overflow": total_used > self.window_size,
"warnings": warnings,
}
# Usage
budget = TokenBudget(window_size=128000)
print("RAG budget:", budget.get_budget("rag_chunks")) # 44800
result = budget.validate({
"system_prompt": 3200,
"history_summary": 8000,
"rag_chunks": 50000, # Over budget!
"recent_turns": 12000,
"query": 500,
})
print("Utilization:", result["utilization"])