Ch 6: Classification with Large Language Models

Approaches & Strategy

How LLMs perform classification, when to prompt vs fine-tune, and how to design label systems that actually work.

Generative LLM as Classifier

A generative LLM can classify by being prompted to map an input into one label from a defined set. Instead of a dedicated classifier head, it uses instruction-following and language understanding to produce the target class — often with a justification or structured output alongside.

💡 Using an LLM for classification is like asking an expert to read a document and stamp it with a category label. The expert can also explain their reasoning, but they are slower and more expensive than a purpose-built sorting machine.

Input

"This product is incredible, best purchase I've made all year!"

↓ LLM classifies ↓

Output

{"label": "positive", "confidence": "high", "reason": "Strongly positive language: 'incredible', 'best purchase'"}

POSITIVE

How It Works

A generative LLM performs classification by being prompted with instructions that define the label set and expected output format. The model uses its language understanding to map the input text to one of the defined categories. This works especially well when:

Classes are described in natural language — The model can understand what "urgent" or "billing_issue" means from its pretraining.
Input is messy or unstructured — LLMs handle typos, slang, and mixed formats better than traditional classifiers.
Examples are scarce — No labeled training data is needed for zero-shot classification.

The Trade-Off

Generative classification is powerful but comes with costs. Compared to a dedicated classifier:

Factor	LLM Classifier	Dedicated Classifier
Setup time	Minutes (write a prompt)	Days-weeks (train model)
Per-prediction cost	Higher (API call)	Lower (small model)
Latency	100ms–2s	1–10ms
Output stability	Can vary between calls	Deterministic
Explainability	Can generate rationale	Feature importance only

For a detailed comparison of when to prompt versus fine-tune, see Topic 2: Prompting vs Fine-Tuning.

→ LLMs can classify through prompting alone — fast to set up and great for messy inputs. But they are slower, costlier, and less stable than dedicated classifiers, so the choice depends on your operational constraints.

Python — LLM-Based Classification

import json
from openai import OpenAI

client = OpenAI()

def classify_with_llm(text, labels):
    """Classify text into one of the given labels using an LLM."""
    prompt = f"""Classify the following text into exactly one of these labels:
{json.dumps(labels)}

Text: "{text}"

Respond with JSON: {{"label": "...", "reason": "..."}}"

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,  # Deterministic for classification
    )

    return json.loads(response.choices[0].message.content)

# Example usage
result = classify_with_llm(
    "This product is incredible, best purchase all year!",
    ["positive", "negative", "neutral"]
)
print(result)
# {"label": "positive", "reason": "Strongly positive language..."}

Follow-up Questions

How do you constrain the LLM to only output valid labels?

Use structured output modes (like OpenAI's JSON mode or function calling), explicitly list valid labels in the prompt, add a validation step that rejects or retries invalid outputs, or use constrained decoding (logit bias) to force the model to choose from a predefined set. Always validate the output programmatically — never trust raw model output in production.

Can the LLM classify in languages it was not primarily trained on?

Modern multilingual LLMs can classify in many languages, but performance degrades for low-resource languages. The model may misunderstand nuance, cultural context, or domain-specific terminology in those languages. Always evaluate on representative data in the target language before deploying.

What about using the LLM's log probabilities for classification?

Some APIs expose token-level log probabilities, which can be used to compute the probability the model assigns to each label token. This provides a more calibrated confidence signal than verbal confidence statements. However, not all providers expose logprobs, and the technique requires careful prompt engineering to work reliably.

Prompting vs Fine-Tuning

Use prompting when the taxonomy changes often, labeled data is limited, and you need to move quickly. Use fine-tuning when labels are stable, volume is high, and you need tighter consistency and lower latency. Prompting buys flexibility; fine-tuning buys specialization.

💡 Prompting is like hiring a smart generalist and giving them a brief. Fine-tuning is like training a specialist for months. The generalist starts faster, but the specialist is better and cheaper at high volume.

Click each approach to see when it is the right choice.

Prompting

✓ Taxonomy changes frequently
✓ Limited labeled data
✓ Need explanations with labels
✓ Rapid prototyping phase
✗ Higher per-prediction cost
✗ Output can be inconsistent

Fine-Tuning

✓ Stable label set
✓ High prediction volume
✓ Low latency required
✓ Tight consistency needed
✗ Requires labeled training data
✗ Slower to iterate

The Decision Matrix

The prompting-versus-fine-tuning decision is not about which is "better" — it is about matching the approach to the operational reality:

Factor	Favor Prompting	Favor Fine-Tuning
Label stability	Labels change monthly	Labels fixed for 6+ months
Data availability	<100 labeled examples	1,000+ labeled examples
Prediction volume	<1,000/day	>10,000/day
Latency target	>500ms acceptable	<50ms required
Explanation needs	Rationale required	Label only
Team ML maturity	Low (prompt-first team)	High (ML ops capability)

The Lifecycle Pattern

Many successful teams follow a lifecycle: start with prompting to validate the task definition and build an evaluation set, then migrate to fine-tuning once the taxonomy stabilizes and volume justifies the investment. The prompting phase generates labeled data that feeds the fine-tuning phase — a natural flywheel. See Topic 3: Zero-Shot vs Few-Shot for the intermediate step.

→ Prompting buys flexibility and speed; fine-tuning buys consistency and cost efficiency. The best teams start with prompting to validate, then graduate to fine-tuning when the task stabilizes.

Python — Prompt-Based vs Fine-Tuned Classifier

# === Approach 1: Prompt-based classification ===
# Fast to set up, no training data needed
def prompt_classify(text, labels):
    prompt = f"Classify into one of {labels}: \"{text}\"\nLabel:"
    # response = llm.generate(prompt, temperature=0)
    # return response.strip()
    pass

# === Approach 2: Fine-tuned classifier ===
# Requires labeled data, but faster and cheaper at scale
from transformers import pipeline

# After fine-tuning on your labeled dataset:
classifier = pipeline(
    "text-classification",
    model="./my-fine-tuned-classifier",
)

# Inference: ~5ms, deterministic, well-calibrated
result = classifier("The delivery was two weeks late.")
print(result)
# [{"label": "complaint", "score": 0.94}]

# Cost comparison at 100K predictions/day:
# Prompting: ~$50-200/day (API costs)
# Fine-tuned: ~$5-10/day (self-hosted GPU)

Follow-up Questions

Can you fine-tune the LLM itself instead of training a separate classifier?

Yes. You can fine-tune the LLM on classification examples so it produces labels more reliably. This gives you the LLM's language understanding with better consistency. However, it is more expensive than fine-tuning a small BERT-style model and you may lose some of the LLM's general capabilities (catastrophic forgetting). LoRA fine-tuning is a practical middle ground.

What if I need explanations but also want low latency?

Use a two-stage pipeline: a fast fine-tuned classifier for the label, and an LLM call only when an explanation is requested (e.g., for audits or escalations). This keeps median latency low while preserving explainability where it matters. Alternatively, train the fine-tuned model to output both label and rationale.

How much labeled data do you need for fine-tuning to beat prompting?

It depends on the task complexity, but a common rule of thumb is 500-2,000 labeled examples for a fine-tuned model to match or exceed a well-crafted few-shot prompt. For very simple binary tasks, even 200 examples can suffice. For complex multi-class problems with subtle distinctions, you may need 5,000+.

Zero-Shot vs Few-Shot Classification

Zero-shot classification gives the model only label definitions or instructions. Few-shot also provides a handful of examples showing how inputs map to classes. Few-shot examples are particularly helpful when labels are subtle, overlapping, or organization-specific — they turn the prompt into a tiny on-the-fly training signal.

💡 Zero-shot is like telling a new employee "sort these into categories A, B, C" with no examples. Few-shot is like showing them 3 sorted items first. The examples dramatically reduce ambiguity.

Prompt (instructions only)

Classify the customer message as: billing, technical, or general.

Message: "My card was charged twice for the same order."
Label:

When Zero-Shot Suffices

Zero-shot classification works when labels are self-explanatory (like "positive" / "negative"), when the model's pretraining data covers the domain well, and when the task is common enough that the model has seen similar tasks during training. It is fastest to set up and requires no example curation.

When Few-Shot Helps

Few-shot examples are valuable when:

Labels are organization-specific — "P1" vs "P2" escalation levels that the model has never seen.
Boundaries are subtle — When "billing" and "account" overlap, examples show where you draw the line.
Format matters — Examples demonstrate the exact output structure you expect.
Edge cases are common — Strategic examples can bias the model toward correct handling of tricky cases.

GPT-3 (Brown et al., 2020) popularized in-context learning, where performance can improve substantially with well-chosen examples. The quality and diversity of examples matters more than quantity — 3-5 carefully chosen examples often outperform 20 random ones.

Example Selection Strategy

Choose few-shot examples that cover each class, include at least one edge case, and represent the actual distribution of inputs. Avoid examples that are too easy or too similar to each other. See Topic 4: Taxonomy Design for how label definitions interact with example selection.

→ Zero-shot works for obvious labels; few-shot works for subtle ones. The examples act as a tiny on-the-fly training signal that reduces ambiguity and improves boundary precision.

Python — Zero-Shot vs Few-Shot Prompting

# === Zero-shot: instructions only ===
zero_shot_prompt = """Classify the customer message as: billing, technical, or general.

Message: "My card was charged twice for the same order."
Label:"""

# === Few-shot: instructions + examples ===
few_shot_prompt = """Classify customer messages. Examples:

"I can't log into my account" -> technical
"Please update my billing address" -> billing
"What are your business hours?" -> general

Message: "My card was charged twice for the same order."
Label:"""

# Few-shot typically improves accuracy by 10-30%
# on ambiguous or organization-specific labels

# Pro tip: use diverse, representative examples
# that cover edge cases, not just easy examples
def select_examples(examples, n_per_class=2):
    """Select diverse examples covering each class."""
    selected = []
    for label in set(e["label"] for e in examples):
        class_examples = [e for e in examples if e["label"] == label]
        # Pick diverse examples (e.g., by embedding distance)
        selected.extend(class_examples[:n_per_class])
    return selected

Follow-up Questions

Does the order of few-shot examples matter?

Yes. LLMs can be sensitive to example ordering, with recency bias causing the model to favor labels seen in the last example. Best practice is to vary the order across calls or place the most representative (not the most recent) class last. Some research suggests randomizing order improves robustness.

How many few-shot examples are enough?

For most classification tasks, 3-5 examples per class provide the best accuracy-to-cost ratio. More examples increase prompt length (and cost) without proportional accuracy gains. However, for very subtle distinctions, up to 10 examples per class can help. Always measure the marginal improvement of additional examples.

Can few-shot examples hurt performance?

Yes. Misleading or unrepresentative examples can bias the model toward incorrect patterns. Examples that are too similar to each other reduce diversity. And mislabeled examples in the prompt directly teach the model wrong associations. Quality control of few-shot examples is critical.

Taxonomy Design

A label taxonomy should be mutually understandable, operationally useful, and as non-overlapping as possible. Labels need clear boundaries, inclusion rules, exclusion rules, and examples. If humans cannot classify consistently, the LLM will not fix the ontology for you.

💡 Taxonomy design is product design, not modeling. A confused label system produces a confused classifier — no amount of model power fixes ambiguous definitions.

Principles of Good Taxonomy

Mutual exclusivity — Each input should clearly belong to one class. If "billing complaint" and "service complaint" frequently overlap, merge or restructure them.
Exhaustive coverage — Every expected input should map to at least one label. Include an "other" category for genuinely novel cases.
Operational utility — Labels should map to different actions. If two labels trigger the same workflow, they should probably be merged.
Clear definitions — Each label needs a description, inclusion criteria, exclusion criteria, and 2-3 examples.

The Human Agreement Test

Before building any classifier, have 3-5 humans independently label a sample of 100+ items. Measure inter-annotator agreement (e.g., Cohen's kappa). If humans agree less than 80% of the time, the taxonomy is the problem, not the model. Fix the definitions before investing in model training.

Taxonomy Evolution

Taxonomies evolve. A well-designed system anticipates this by:

Strategy	Benefit
Version your taxonomy	Track which labels existed at which time
Log raw predictions + inputs	Enables re-labeling when taxonomy changes
Include "other/unknown"	Catches inputs that don't fit current labels
Monitor confusion pairs	Identifies labels that need clearer boundaries

This connects directly to Topic 5: Class Imbalance — poorly designed taxonomies often create artificial imbalance by merging common cases into one large class.

→ Many classification failures come from unclear class definitions, not weak models. Treat taxonomy design as product design: define boundaries, test with humans, and version your label system.

Python — Taxonomy Definition & Validation

# A well-structured taxonomy definition for an LLM classifier
taxonomy = {
    "billing": {
        "description": "Issues related to charges, invoices, refunds, or payment methods",
        "includes": ["double charges", "refund requests", "invoice errors"],
        "excludes": ["account access issues", "feature requests"],
        "examples": [
            "I was charged twice for my subscription",
            "Can I get a refund for last month?",
        ],
    },
    "technical": {
        "description": "Issues with product functionality, bugs, or access",
        "includes": ["login failures", "app crashes", "feature not working"],
        "excludes": ["billing questions", "general inquiries"],
        "examples": [
            "The app crashes when I try to upload a file",
            "I can't reset my password",
        ],
    },
}

# Validate taxonomy quality: check for overlapping keywords
def check_overlap(taxonomy):
    for l1, d1 in taxonomy.items():
        for l2, d2 in taxonomy.items():
            if l1 >= l2: continue
            overlap = set(d1["includes"]) & set(d2["includes"])
            if overlap:
                print(f"WARNING: {l1} and {l2} overlap on: {overlap}")

Follow-up Questions

How many classes can an LLM reliably handle?

LLMs can handle dozens of classes if the labels are well-defined and distinct. Performance degrades when labels are semantically similar or when the prompt becomes too long with definitions. For very large taxonomies (100+ classes), consider a hierarchical approach: classify into broad categories first, then sub-classify within each category.

Should the taxonomy be flat or hierarchical?

Hierarchical taxonomies are better when you have many fine-grained categories that naturally group into broader themes. They reduce cognitive load on the model and allow graceful degradation — if the model gets the broad category right but the subcategory wrong, you still have useful information. Flat taxonomies are simpler but become unwieldy above ~20 classes.

What happens when the taxonomy changes in production?

Taxonomy changes can silently break classifiers. Version your taxonomy, log raw inputs alongside predictions, and re-evaluate the model whenever labels change. For prompted classifiers, update the prompt immediately. For fine-tuned models, you need to retrain or at minimum re-evaluate and patch the most affected classes.

Class Imbalance

Class imbalance means some categories appear far more often than others. Prompt-only systems may overpredict majority classes unless the prompt explicitly describes minority cases. Address imbalance through better examples, targeted evaluation, cost-sensitive policies, or reweighted training data.

💡 Imbalance is like a search-and-rescue team that mostly finds lost hikers, not lost climbers. If you only measure "found people," you will think the team is great — even if they miss every climber.

Warning: "Fraud" has only 2% of samples. A naive model can reach 95% accuracy by never predicting fraud.

Why Imbalance Matters

In imbalanced datasets, a model can achieve high overall accuracy by simply predicting the majority class for every input. If 98% of transactions are legitimate, a model that always says "not fraud" is 98% accurate — but completely useless for its actual purpose.

For LLM-based classifiers, imbalance manifests differently. The model is not trained on your specific distribution, but its prompting behavior can be biased by the number of examples per class in the prompt or by the base rates in its pretraining data.

Mitigation Strategies

Strategy	Approach	When to Use
Prompt engineering	Explicitly describe minority classes with more detail and examples	Prompt-based systems
Balanced evaluation	Use per-class metrics (precision, recall, F1) rather than overall accuracy	Always
Cost-sensitive review	Route minority-class predictions to human review more aggressively	High-stakes domains
Reweighted training	Oversample minority classes or weight their loss higher during fine-tuning	Fine-tuned models
Threshold tuning	Adjust classification thresholds per class based on business cost	Production deployment

Imbalance is both a data problem and a decision-policy problem. You may care more about minority recall in fraud, safety, or medical triage than about raw overall accuracy. The right evaluation metric should reflect that priority. See Topic 7: Metrics That Matter.

→ Imbalance is not just a data problem — it is a decision-policy problem. Design your evaluation metrics, prompt strategies, and review policies around the cost of being wrong for each class.

Python — Handling Class Imbalance

from sklearn.metrics import classification_report
import numpy as np

# Simulated imbalanced predictions
y_true = ["normal"]*980 + ["fraud"]*20
y_pred = ["normal"]*990 + ["fraud"]*10  # Misses 50% of fraud

# Overall accuracy looks great but is misleading
accuracy = sum(t == p for t, p in zip(y_true, y_pred)) / len(y_true)
print(f"Overall accuracy: {accuracy:.1%}")  # 98.0% - deceptive!

# Per-class metrics reveal the real problem
print(classification_report(y_true, y_pred))
# fraud: recall = 0.50 (missing half of all fraud cases!)

# === For prompted LLMs: emphasize minority class ===
imbalance_aware_prompt = """Classify transactions. Pay special attention to
fraud indicators - it is critical not to miss fraud cases.

Fraud signals: unusual amounts, new locations, rapid succession,
mismatched billing details.

IMPORTANT: When uncertain between fraud and normal, flag as fraud
for human review. False negatives are much costlier than false positives."""

Follow-up Questions

Does few-shot example balance affect LLM predictions?

Yes. If your few-shot prompt has 5 examples of class A and 1 of class B, the model may develop a prior bias toward A. For balanced classification, provide roughly equal numbers of examples per class, or explicitly tell the model that classes are equally likely. This is different from real-world base rates, which can be encoded separately through instructions.

How do you evaluate on imbalanced data?

Use stratified sampling for your evaluation set to ensure all classes are represented. Report per-class precision and recall, macro-F1 (which weights all classes equally), and a confusion matrix. Never rely on accuracy alone. For high-stakes minority classes, track recall specifically because missing a fraud case or safety violation is costlier than a false alarm.

Can synthetic data help with minority classes?

Yes. LLMs can generate synthetic examples for minority classes to augment training data or few-shot prompts. However, the synthetic examples must be validated against real data distributions to avoid introducing artifacts. The best approach is to generate candidates with an LLM and then have domain experts filter and correct them.

Production & Reliability

Multi-label challenges, evaluation metrics, confidence estimation, human review, and the failure modes that break classification systems in production.

Multi-Label Classification

In single-label classification, exactly one class is chosen. In multi-label, multiple labels may apply simultaneously. The prompt, schema, and evaluation strategy must all change — multi-label is not just a small extension of single-label; it changes the entire decision structure.

💡 Single-label is like choosing one genre for a movie. Multi-label is like tagging a movie with all genres that apply: it can be both "comedy" and "romance" at the same time.

Input Text

"I was charged twice and the app keeps crashing when I try to view my invoice."

Applicable Labels (click to toggle)

2 labels selected — multi-label allows this.

How Multi-Label Differs

The structural differences between single-label and multi-label classification are significant:

Aspect	Single-Label	Multi-Label
Decision rule	Pick the highest-scoring class	Each class has an independent threshold
Output format	One label string	List/set of label strings
Evaluation	Accuracy, macro-F1	Subset accuracy, Hamming loss, per-label F1
Common error	Picking wrong class	Under-tagging or over-tagging
Prompt design	"Choose ONE label"	"List ALL applicable labels"

Challenges with LLMs

LLMs tend to either under-tag (missing applicable labels) or over-tag (applying labels that are only marginally relevant). Specific challenges include:

Calibration — How confident must the model be to include a label? Each label needs its own inclusion threshold.
Order effects — The model may apply the first few relevant labels and stop, missing later ones.
Validation — Checking that a list of labels is valid is harder than checking a single label.

For related guidance on evaluation metrics for multi-label settings, see Topic 7: Metrics That Matter.

→ Multi-label classification changes the decision structure, not just the output format. Each label needs its own inclusion threshold, and evaluation must account for both under-tagging and over-tagging.

Python — Multi-Label LLM Classification

import json

def multi_label_classify(text, labels):
    """Classify text with ALL applicable labels (multi-label)."""
    prompt = f"""Analyze the following text and select ALL labels that apply.
Only include a label if it is clearly relevant.

Available labels: {json.dumps(labels)}

Text: "{text}"

Respond with JSON: {{"labels": ["label1", "label2", ...], "reasons": {{"label1": "...", ...}}}}"""

    # response = llm.generate(prompt)
    # result = json.loads(response)
    # return result
    pass

# === Multi-label evaluation ===
from sklearn.metrics import hamming_loss, f1_score
import numpy as np

# Binary vectors: [billing, technical, account, general]
y_true = np.array([[1,1,0,0], [0,0,1,1], [1,0,0,0]])
y_pred = np.array([[1,1,0,0], [0,1,1,0], [1,0,0,0]])

print(f"Hamming Loss: {hamming_loss(y_true, y_pred):.3f}")
print(f"Micro F1:     {f1_score(y_true, y_pred, average='micro'):.3f}")
print(f"Macro F1:     {f1_score(y_true, y_pred, average='macro'):.3f}")

Follow-up Questions

How do you set thresholds for each label independently?

For each label, use a calibration set to find the threshold that maximizes F1 or another target metric for that specific class. This means each label can have a different inclusion threshold — a common pattern is higher thresholds for labels with high false-positive costs and lower thresholds for labels where recall is critical.

Should you use separate prompts for each label or one combined prompt?

A combined prompt is more efficient (one API call) but may suffer from label interference. Separate prompts (one per label) give more independent predictions but cost more. The hybrid approach is to use a combined prompt for initial labeling and separate prompts for labels where the combined approach shows high confusion.

What is Hamming loss and why use it for multi-label?

Hamming loss measures the fraction of labels that are incorrectly predicted (either false positives or false negatives). It is a natural metric for multi-label because it penalizes each wrong label independently, unlike subset accuracy which requires the entire label set to match exactly. A Hamming loss of 0.05 means 5% of individual label decisions are wrong.

Metrics That Matter

Accuracy is a starting point, but precision, recall, F1, confusion matrices, and calibration are usually more informative. The best metric is the one aligned to the cost of being wrong — if false negatives are expensive, optimize recall; if false positives trigger painful review, optimize precision.

💡 Metrics are like medical tests: accuracy tells you how often the test is right, but it does not tell you whether it is catching the deadly conditions (recall) or sending healthy people for unnecessary surgery (precision).

Click a metric to see when and why it matters.

Precision

Of all items the model labeled positive, how many actually were? Optimize when false positives are costly.

TP / (TP + FP)

Recall

Of all items that actually were positive, how many did the model catch? Optimize when false negatives are dangerous.

TP / (TP + FN)

F1 Score

Harmonic mean of precision and recall. Useful when you need a single number that balances both concerns.

2 * (P * R) / (P + R)

Macro-F1

Average F1 across all classes, treating each class equally regardless of size. Critical for imbalanced datasets.

mean(F1_class_i for all i)

Choosing the Right Metric

The metric you optimize should reflect the business cost of being wrong:

Scenario	Optimize	Why
Fraud detection	Recall	Missing fraud is catastrophic
Content moderation	Precision	Over-censoring drives users away
Medical triage	Recall (sensitivity)	Missing a critical case is unacceptable
Lead qualification	Precision	Sales time wasted on bad leads is expensive
General classification	Macro-F1	Balanced performance across all classes

Beyond Standard Metrics

Senior candidates distinguish themselves by mentioning operational metrics:

Abstention rate — How often does the system say "I don't know" and escalate?
Reviewer overturn rate — How often do humans disagree with the model's label?
Calibration — When the model says 90% confident, is it right 90% of the time? See Topic 8: Confidence Estimation.
Confusion matrix patterns — Which specific class pairs are most frequently confused?

These operational metrics connect to Topic 9: Human in the Loop and Topic 10: Production Failure Modes.

→ The best metric is the one aligned to the cost of being wrong. Connect your evaluation to business risk: optimize recall when false negatives are expensive, precision when false positives are costly.

Python — Classification Metrics

from sklearn.metrics import (
    classification_report, confusion_matrix, f1_score
)
import numpy as np

# Ground truth and model predictions
y_true = ["billing"]*30 + ["technical"]*50 + ["general"]*20
y_pred = ["billing"]*25 + ["technical"]*5 + \
         ["technical"]*45 + ["billing"]*5 + \
         ["general"]*18 + ["technical"]*2

# Full classification report with per-class metrics
print(classification_report(y_true, y_pred))

# Macro-F1 treats all classes equally (important for imbalanced data)
macro = f1_score(y_true, y_pred, average="macro")
print(f"Macro F1: {macro:.3f}")

# Confusion matrix reveals which classes are confused
cm = confusion_matrix(y_true, y_pred, labels=["billing","technical","general"])
print("Confusion Matrix:")
print(cm)
# Rows = true, Cols = predicted
# Off-diagonal values show misclassification patterns

Follow-up Questions

When should you use micro-F1 vs macro-F1?

Micro-F1 aggregates TP/FP/FN across all classes before computing F1, so larger classes dominate the score. Macro-F1 computes F1 per class then averages, giving equal weight to all classes. Use macro-F1 when you care about performance on minority classes. Use micro-F1 when overall prediction accuracy matters most.

How do you interpret a confusion matrix for LLM classifiers?

Look for clusters of off-diagonal values that indicate systematic confusion between specific class pairs. These patterns tell you where the taxonomy needs clearer boundaries (see Topic 4: Taxonomy Design), where few-shot examples should be added, or where the prompt needs more specific guidance about distinguishing those classes.

What is calibration and why does it matter for classification?

Calibration means the model's confidence scores match real-world accuracy. A well-calibrated model that says "90% confident" should be correct about 90% of the time. This matters for threshold-based routing: if you route low-confidence cases to humans, poor calibration means you are either escalating too much (wasting reviewer time) or too little (missing errors).

Confidence Estimation

Confidence can be estimated through constrained label probabilities, self-consistency checks, calibration sets, or agreement across prompt variants. Raw verbal confidence statements from the model are not reliable enough on their own — confidence should be measured externally whenever possible.

💡 An LLM saying "I'm 95% confident" is like a student saying "I'm sure I aced the test." You need an external measurement (the grade) to know if their self-assessment is accurate.

Confidence estimation methods, ranked by reliability.

Token Log Probabilities

Use model's actual probability for the label token. Requires API support.

Reliability:

High

Self-Consistency

Run same input N times with temperature > 0. Agreement rate = confidence.

Reliability:

Good

Calibration Set

Map model scores to real probabilities using held-out labeled data.

Reliability:

High

Verbal Confidence

Ask the model "how confident are you?" Cheapest but least reliable.

Reliability:

Low

Why Verbal Confidence Fails

When asked "how confident are you?", LLMs tend to be overconfident. They produce high verbal confidence even for wrong answers, because they are trained to generate plausible-sounding text, not calibrated probabilities. A model might say "95% confident" for an answer it gets wrong 40% of the time.

Better Approaches

Method	How It Works	Trade-Off
Token logprobs	Extract the log probability the model assigns to the label token	Best signal, but not all APIs expose it
Self-consistency	Sample N responses and measure label agreement	Good signal, but costs N API calls per prediction
Prompt variants	Rephrase the prompt 3-5 ways and check agreement	Catches prompt sensitivity, moderately expensive
Calibration mapping	Use a held-out labeled set to build a mapping from raw scores to true probabilities	Excellent accuracy, requires labeled calibration data
Ensemble agreement	Run multiple models and check agreement	Most robust, most expensive

Production Pattern

Production systems typically combine multiple signals: model scores, retrieval evidence, schema validity, and historical error patterns. The combined signal feeds a routing decision: high-confidence cases are auto-routed, low-confidence cases go to human review. See Topic 9: Human in the Loop.

→ Never trust verbal confidence alone. Measure confidence externally using logprobs, self-consistency, or calibration sets. Production systems combine multiple signals to decide when to auto-route vs escalate.

Python — Self-Consistency Confidence

from collections import Counter

def self_consistency_classify(text, labels, n_samples=5):
    """Run classification N times and use agreement as confidence."""
    predictions = []
    for _ in range(n_samples):
        # Each call uses temperature > 0 for variation
        # pred = llm.classify(text, labels, temperature=0.7)
        # predictions.append(pred)
        pass

    # Count label frequencies across samples
    counts = Counter(predictions)
    top_label, top_count = counts.most_common(1)[0]

    # Agreement rate = confidence estimate
    confidence = top_count / n_samples

    return {
        "label": top_label,
        "confidence": confidence,
        "distribution": dict(counts),
        "should_escalate": confidence < 0.6,  # Low agreement = uncertain
    }

# Example: 5 runs produce ["billing","billing","billing","technical","billing"]
# → label="billing", confidence=0.8, should_escalate=False

# Example: 5 runs produce ["billing","technical","billing","general","technical"]
# → label="billing", confidence=0.4, should_escalate=True

Follow-up Questions

How do you calibrate an LLM classifier?

Collect a calibration dataset with known labels. Run the classifier and record its confidence scores. Then fit a calibration function (e.g., Platt scaling or isotonic regression) that maps raw scores to true probabilities. After calibration, a score of 0.9 should mean the model is correct about 90% of the time. Re-calibrate whenever the model or taxonomy changes.

Is temperature 0 always best for classification?

Temperature 0 gives the most deterministic output, which is usually best for single-pass classification. However, for self-consistency confidence estimation, you need temperature > 0 to generate variation. The typical pattern is temperature=0 for the production classification call and temperature=0.5-0.7 for confidence estimation samples.

What threshold should trigger human review?

The threshold depends on the cost of errors vs the cost of human review. Start with a conservative threshold (e.g., escalate everything below 80% confidence) and tune based on reviewer overturn rates. If reviewers rarely change the model's answer, raise the threshold. If they frequently disagree, lower it. The optimal threshold is where marginal review cost equals marginal error cost.

Human in the Loop

Human review is appropriate when decisions are high-impact, ambiguous, novel, or compliance-sensitive. It is also valuable when the model has low confidence or conflicting evidence. A mature design routes easy cases automatically and reserves scarce reviewer attention for cases where it creates the most risk reduction.

💡 Human-in-the-loop is like a hospital triage system: the nurse handles routine cases, but the doctor sees anything flagged as uncertain or high-risk. The goal is not to replace the nurse — it is to focus the doctor's time where it matters most.

A typical classification pipeline with human escalation routing.

LLM classifies input and produces label + confidence

↓

Route decision: confidence > threshold?

↓ Yes (high confidence)

✓

Auto-route to action

~80% of volume

↓ No (low confidence)

👁

Human review queue

~20% of volume

↓ Feedback loop

Human decisions become training/evaluation data for model improvement

When to Escalate

Route to human review when:

Low model confidence — The classifier is uncertain (see Topic 8: Confidence Estimation).
High-impact decision — Account suspension, medical triage, legal compliance.
Novel input — The input looks different from what the model was trained or prompted on.
Conflicting signals — Multiple models or prompt variants disagree.
Frequently confused classes — The specific class pair has a high historical confusion rate.

Human Review as a Data Engine

The most valuable aspect of human review is not the individual decision — it is the data it generates. Human decisions on escalated cases are the highest-quality training and evaluation data available because they represent the hardest, most informative examples. Feed this data back into:

Evaluation sets — Hard cases make evaluation more realistic.
Few-shot examples — Resolved edge cases become great prompt examples.
Fine-tuning data — Labeled hard cases improve model weak spots.
Taxonomy refinement — Patterns in escalations reveal where labels need clearer boundaries (see Topic 4: Taxonomy Design).

→ Human review is a precision tool, not a sign of system weakness. A mature design routes easy cases automatically and treats reviewer decisions as the most valuable data source for model improvement.

Python — Confidence-Based Routing

def classify_and_route(text, confidence_threshold=0.8):
    """Classify with automatic routing based on confidence."""
    # Step 1: Get classification with confidence
    result = classify_with_confidence(text)

    # Step 2: Route based on confidence
    if result["confidence"] >= confidence_threshold:
        # High confidence: auto-route
        return {
            "action": "auto_route",
            "label": result["label"],
            "confidence": result["confidence"],
        }
    else:
        # Low confidence: escalate to human reviewer
        return {
            "action": "human_review",
            "suggested_label": result["label"],
            "confidence": result["confidence"],
            "reason": result.get("reason", "Low confidence"),
            "priority": "high" if result["confidence"] < 0.5 else "normal",
        }

# Step 3: Log human decisions for model improvement
def log_review_decision(text, model_label, human_label):
    """Store human review outcomes for retraining."""
    record = {
        "text": text,
        "model_label": model_label,
        "human_label": human_label,
        "overturned": model_label != human_label,
    }
    # Append to training/evaluation dataset
    # These hard cases are the most valuable training data
    return record

Follow-up Questions

How do you size the human review team?

Size depends on the escalation rate and SLA requirements. If 20% of 10,000 daily predictions escalate and each review takes 2 minutes, you need ~67 reviewer-hours per day. Monitor the escalation rate as the model improves — it should decrease over time as hard cases feed back into training. Start with more reviewers than you think you need and scale down.

Should reviewers see the model's prediction?

It depends on your goal. Showing the model's prediction can anchor the reviewer, causing them to agree with the model more often than they should. For unbiased evaluation, hide the prediction. For production efficiency, show it — reviewers are faster when they can confirm or reject rather than classify from scratch. Consider hiding it for evaluation sets but showing it for production review.

How do you handle reviewer disagreements?

Use multi-reviewer consensus for high-stakes decisions (2-3 reviewers per case). Track inter-reviewer agreement rates. If reviewers frequently disagree on specific labels, that is a signal the taxonomy needs clearer boundaries (see Topic 4: Taxonomy Design). Adjudication processes should be documented and consistent.

Production Failure Modes

Common production failures include label drift after taxonomy changes, prompt brittleness, hidden format errors, poor minority-class treatment, and false confidence on ambiguous inputs. Classification quality depends on prompts, data definitions, evaluation sets, routing policies, and review loops — not just model accuracy.

💡 A classification system is like a supply chain: the end product (correct labels) depends on every link — data quality, prompt stability, taxonomy clarity, evaluation rigor, and review processes. A break in any link degrades the whole system.

Click a failure mode to see its symptoms and mitigation.

Label Drift

Taxonomy changes break the classifier silently.

Prompt Brittleness

Small prompt changes cause large accuracy swings.

Format Errors

Model outputs labels in unexpected formats.

Minority Class Neglect

Rare but important classes are systematically missed.

Upstream Data Shift

Changes in input preprocessing silently alter classifier behavior.

Mitigation

Version your taxonomy. Log raw inputs alongside predictions. Re-evaluate the classifier whenever labels change. Set up alerts for prediction distribution shifts that indicate a taxonomy-model mismatch.

The System-Level View

The strongest interview answer treats classification quality as a system property, not a model property. Quality depends on the entire pipeline:

Data definitions — Ambiguous taxonomies produce ambiguous predictions (see Topic 4: Taxonomy Design).
Prompt stability — If minor prompt tweaks change 10% of predictions, the system is fragile.
Evaluation coverage — If your test set does not include edge cases and minority classes, you will not detect problems (see Topic 7: Metrics That Matter).
Routing policies — The confidence threshold determines the auto/escalation split (see Topic 9: Human in the Loop).
Monitoring — Tracking only aggregate accuracy will miss class-specific degradation.

Common Failure Catalog

Failure	Symptom	Detection
Label drift	Old labels appear in output after taxonomy update	Monitor label distribution over time
Prompt brittleness	Accuracy drops after minor prompt edit	A/B test prompt changes on evaluation set
Format errors	Downstream systems crash on unexpected output	Schema validation on every prediction
Minority neglect	Per-class recall near zero for rare classes	Per-class metrics, not just overall accuracy
Silent degradation	Accuracy drops without obvious cause	Automated regression testing on evaluation set
Upstream shift	Input preprocessing changes alter effective input	Monitor input feature distributions

Monitoring Best Practices

Production classification systems should monitor:

Prediction distribution — Alert if the fraction of any class shifts by more than a threshold.
Confidence distribution — Alert if average confidence drops (may indicate out-of-distribution inputs).
Escalation rate — Alert if the fraction of human-reviewed cases spikes.
Per-class metrics — Track precision and recall per class, not just overall accuracy.

→ Classification quality is a system property. Monitor prompts, data definitions, evaluation sets, routing policies, and review loops. If you only watch a single accuracy number, you will miss the real reasons the system is succeeding or failing.

Python — Classification Monitoring Pipeline

from collections import Counter
from datetime import datetime

class ClassificationMonitor:
    """Monitor classification system health in production."""

    def __init__(self, expected_distribution, drift_threshold=0.1):
        self.expected = expected_distribution
        self.threshold = drift_threshold
        self.predictions = []

    def log_prediction(self, label, confidence, text_hash):
        """Log each prediction for monitoring."""
        self.predictions.append({
            "label": label,
            "confidence": confidence,
            "timestamp": datetime.utcnow(),
            "text_hash": text_hash,
        })

    def check_distribution_drift(self):
        """Alert if label distribution has shifted significantly."""
        counts = Counter(p["label"] for p in self.predictions)
        total = sum(counts.values())
        alerts = []
        for label, expected_frac in self.expected.items():
            actual_frac = counts.get(label, 0) / total
            drift = abs(actual_frac - expected_frac)
            if drift > self.threshold:
                alerts.append(f"DRIFT: {label} expected {expected_frac:.0%}, got {actual_frac:.0%}")
        return alerts

    def check_confidence_drop(self, min_avg=0.7):
        """Alert if average confidence drops below threshold."""
        if not self.predictions: return []
        avg_conf = sum(p["confidence"] for p in self.predictions) / len(self.predictions)
        if avg_conf < min_avg:
            return [f"LOW CONFIDENCE: avg={avg_conf:.2f} (threshold={min_avg})"]
        return []

Follow-up Questions

How do you test prompt changes safely?

Use a shadow testing pattern: run the new prompt alongside the old one on live traffic, compare outputs, and only switch when the new prompt matches or exceeds performance on your evaluation set. Never change a production prompt without first evaluating it against 200+ labeled examples covering all classes and known edge cases.

What causes silent degradation?

Common causes include model version updates by the API provider (behavior changes between model versions), input distribution shift (new customer segments produce different text), upstream preprocessing changes (a new text cleaning step alters what the classifier sees), and taxonomy drift (real-world categories evolve while labels stay static).

How often should you re-evaluate a production classifier?

Continuously, via automated regression tests on a held-out evaluation set. Run daily metrics checks on prediction distributions and confidence levels. Do a full manual evaluation with fresh labeled data monthly or after any significant change (prompt update, model version change, taxonomy modification). The cost of re-evaluation is tiny compared to the cost of undetected degradation.