How LLMs perform classification, when to prompt vs fine-tune, and how to design label systems that actually work.
Generative LLM as Classifier
How It Works
A generative LLM performs classification by being prompted with instructions that define the label set and expected output format. The model uses its language understanding to map the input text to one of the defined categories. This works especially well when:
- Classes are described in natural language — The model can understand what "urgent" or "billing_issue" means from its pretraining.
- Input is messy or unstructured — LLMs handle typos, slang, and mixed formats better than traditional classifiers.
- Examples are scarce — No labeled training data is needed for zero-shot classification.
The Trade-Off
Generative classification is powerful but comes with costs. Compared to a dedicated classifier:
| Factor | LLM Classifier | Dedicated Classifier |
|---|---|---|
| Setup time | Minutes (write a prompt) | Days-weeks (train model) |
| Per-prediction cost | Higher (API call) | Lower (small model) |
| Latency | 100ms–2s | 1–10ms |
| Output stability | Can vary between calls | Deterministic |
| Explainability | Can generate rationale | Feature importance only |
For a detailed comparison of when to prompt versus fine-tune, see Topic 2: Prompting vs Fine-Tuning.
Python — LLM-Based Classification
import json
from openai import OpenAI
client = OpenAI()
def classify_with_llm(text, labels):
"""Classify text into one of the given labels using an LLM."""
prompt = f"""Classify the following text into exactly one of these labels:
{json.dumps(labels)}
Text: "{text}"
Respond with JSON: {{"label": "...", "reason": "..."}}"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0, # Deterministic for classification
)
return json.loads(response.choices[0].message.content)
# Example usage
result = classify_with_llm(
"This product is incredible, best purchase all year!",
["positive", "negative", "neutral"]
)
print(result)
# {"label": "positive", "reason": "Strongly positive language..."}
How do you constrain the LLM to only output valid labels?
Can the LLM classify in languages it was not primarily trained on?
What about using the LLM's log probabilities for classification?
Prompting vs Fine-Tuning
- ✓ Taxonomy changes frequently
- ✓ Limited labeled data
- ✓ Need explanations with labels
- ✓ Rapid prototyping phase
- ✗ Higher per-prediction cost
- ✗ Output can be inconsistent
- ✓ Stable label set
- ✓ High prediction volume
- ✓ Low latency required
- ✓ Tight consistency needed
- ✗ Requires labeled training data
- ✗ Slower to iterate
The Decision Matrix
The prompting-versus-fine-tuning decision is not about which is "better" — it is about matching the approach to the operational reality:
| Factor | Favor Prompting | Favor Fine-Tuning |
|---|---|---|
| Label stability | Labels change monthly | Labels fixed for 6+ months |
| Data availability | <100 labeled examples | 1,000+ labeled examples |
| Prediction volume | <1,000/day | >10,000/day |
| Latency target | >500ms acceptable | <50ms required |
| Explanation needs | Rationale required | Label only |
| Team ML maturity | Low (prompt-first team) | High (ML ops capability) |
The Lifecycle Pattern
Many successful teams follow a lifecycle: start with prompting to validate the task definition and build an evaluation set, then migrate to fine-tuning once the taxonomy stabilizes and volume justifies the investment. The prompting phase generates labeled data that feeds the fine-tuning phase — a natural flywheel. See Topic 3: Zero-Shot vs Few-Shot for the intermediate step.
Python — Prompt-Based vs Fine-Tuned Classifier
# === Approach 1: Prompt-based classification ===
# Fast to set up, no training data needed
def prompt_classify(text, labels):
prompt = f"Classify into one of {labels}: \"{text}\"\nLabel:"
# response = llm.generate(prompt, temperature=0)
# return response.strip()
pass
# === Approach 2: Fine-tuned classifier ===
# Requires labeled data, but faster and cheaper at scale
from transformers import pipeline
# After fine-tuning on your labeled dataset:
classifier = pipeline(
"text-classification",
model="./my-fine-tuned-classifier",
)
# Inference: ~5ms, deterministic, well-calibrated
result = classifier("The delivery was two weeks late.")
print(result)
# [{"label": "complaint", "score": 0.94}]
# Cost comparison at 100K predictions/day:
# Prompting: ~$50-200/day (API costs)
# Fine-tuned: ~$5-10/day (self-hosted GPU)
Can you fine-tune the LLM itself instead of training a separate classifier?
What if I need explanations but also want low latency?
How much labeled data do you need for fine-tuning to beat prompting?
Zero-Shot vs Few-Shot Classification
When Zero-Shot Suffices
Zero-shot classification works when labels are self-explanatory (like "positive" / "negative"), when the model's pretraining data covers the domain well, and when the task is common enough that the model has seen similar tasks during training. It is fastest to set up and requires no example curation.
When Few-Shot Helps
Few-shot examples are valuable when:
- Labels are organization-specific — "P1" vs "P2" escalation levels that the model has never seen.
- Boundaries are subtle — When "billing" and "account" overlap, examples show where you draw the line.
- Format matters — Examples demonstrate the exact output structure you expect.
- Edge cases are common — Strategic examples can bias the model toward correct handling of tricky cases.
GPT-3 (Brown et al., 2020) popularized in-context learning, where performance can improve substantially with well-chosen examples. The quality and diversity of examples matters more than quantity — 3-5 carefully chosen examples often outperform 20 random ones.
Example Selection Strategy
Choose few-shot examples that cover each class, include at least one edge case, and represent the actual distribution of inputs. Avoid examples that are too easy or too similar to each other. See Topic 4: Taxonomy Design for how label definitions interact with example selection.
Python — Zero-Shot vs Few-Shot Prompting
# === Zero-shot: instructions only ===
zero_shot_prompt = """Classify the customer message as: billing, technical, or general.
Message: "My card was charged twice for the same order."
Label:"""
# === Few-shot: instructions + examples ===
few_shot_prompt = """Classify customer messages. Examples:
"I can't log into my account" -> technical
"Please update my billing address" -> billing
"What are your business hours?" -> general
Message: "My card was charged twice for the same order."
Label:"""
# Few-shot typically improves accuracy by 10-30%
# on ambiguous or organization-specific labels
# Pro tip: use diverse, representative examples
# that cover edge cases, not just easy examples
def select_examples(examples, n_per_class=2):
"""Select diverse examples covering each class."""
selected = []
for label in set(e["label"] for e in examples):
class_examples = [e for e in examples if e["label"] == label]
# Pick diverse examples (e.g., by embedding distance)
selected.extend(class_examples[:n_per_class])
return selected
Does the order of few-shot examples matter?
How many few-shot examples are enough?
Can few-shot examples hurt performance?
Taxonomy Design
Principles of Good Taxonomy
- Mutual exclusivity — Each input should clearly belong to one class. If "billing complaint" and "service complaint" frequently overlap, merge or restructure them.
- Exhaustive coverage — Every expected input should map to at least one label. Include an "other" category for genuinely novel cases.
- Operational utility — Labels should map to different actions. If two labels trigger the same workflow, they should probably be merged.
- Clear definitions — Each label needs a description, inclusion criteria, exclusion criteria, and 2-3 examples.
The Human Agreement Test
Before building any classifier, have 3-5 humans independently label a sample of 100+ items. Measure inter-annotator agreement (e.g., Cohen's kappa). If humans agree less than 80% of the time, the taxonomy is the problem, not the model. Fix the definitions before investing in model training.
Taxonomy Evolution
Taxonomies evolve. A well-designed system anticipates this by:
| Strategy | Benefit |
|---|---|
| Version your taxonomy | Track which labels existed at which time |
| Log raw predictions + inputs | Enables re-labeling when taxonomy changes |
| Include "other/unknown" | Catches inputs that don't fit current labels |
| Monitor confusion pairs | Identifies labels that need clearer boundaries |
This connects directly to Topic 5: Class Imbalance — poorly designed taxonomies often create artificial imbalance by merging common cases into one large class.
Python — Taxonomy Definition & Validation
# A well-structured taxonomy definition for an LLM classifier
taxonomy = {
"billing": {
"description": "Issues related to charges, invoices, refunds, or payment methods",
"includes": ["double charges", "refund requests", "invoice errors"],
"excludes": ["account access issues", "feature requests"],
"examples": [
"I was charged twice for my subscription",
"Can I get a refund for last month?",
],
},
"technical": {
"description": "Issues with product functionality, bugs, or access",
"includes": ["login failures", "app crashes", "feature not working"],
"excludes": ["billing questions", "general inquiries"],
"examples": [
"The app crashes when I try to upload a file",
"I can't reset my password",
],
},
}
# Validate taxonomy quality: check for overlapping keywords
def check_overlap(taxonomy):
for l1, d1 in taxonomy.items():
for l2, d2 in taxonomy.items():
if l1 >= l2: continue
overlap = set(d1["includes"]) & set(d2["includes"])
if overlap:
print(f"WARNING: {l1} and {l2} overlap on: {overlap}")
How many classes can an LLM reliably handle?
Should the taxonomy be flat or hierarchical?
What happens when the taxonomy changes in production?
Class Imbalance
Why Imbalance Matters
In imbalanced datasets, a model can achieve high overall accuracy by simply predicting the majority class for every input. If 98% of transactions are legitimate, a model that always says "not fraud" is 98% accurate — but completely useless for its actual purpose.
For LLM-based classifiers, imbalance manifests differently. The model is not trained on your specific distribution, but its prompting behavior can be biased by the number of examples per class in the prompt or by the base rates in its pretraining data.
Mitigation Strategies
| Strategy | Approach | When to Use |
|---|---|---|
| Prompt engineering | Explicitly describe minority classes with more detail and examples | Prompt-based systems |
| Balanced evaluation | Use per-class metrics (precision, recall, F1) rather than overall accuracy | Always |
| Cost-sensitive review | Route minority-class predictions to human review more aggressively | High-stakes domains |
| Reweighted training | Oversample minority classes or weight their loss higher during fine-tuning | Fine-tuned models |
| Threshold tuning | Adjust classification thresholds per class based on business cost | Production deployment |
Imbalance is both a data problem and a decision-policy problem. You may care more about minority recall in fraud, safety, or medical triage than about raw overall accuracy. The right evaluation metric should reflect that priority. See Topic 7: Metrics That Matter.
Python — Handling Class Imbalance
from sklearn.metrics import classification_report
import numpy as np
# Simulated imbalanced predictions
y_true = ["normal"]*980 + ["fraud"]*20
y_pred = ["normal"]*990 + ["fraud"]*10 # Misses 50% of fraud
# Overall accuracy looks great but is misleading
accuracy = sum(t == p for t, p in zip(y_true, y_pred)) / len(y_true)
print(f"Overall accuracy: {accuracy:.1%}") # 98.0% - deceptive!
# Per-class metrics reveal the real problem
print(classification_report(y_true, y_pred))
# fraud: recall = 0.50 (missing half of all fraud cases!)
# === For prompted LLMs: emphasize minority class ===
imbalance_aware_prompt = """Classify transactions. Pay special attention to
fraud indicators - it is critical not to miss fraud cases.
Fraud signals: unusual amounts, new locations, rapid succession,
mismatched billing details.
IMPORTANT: When uncertain between fraud and normal, flag as fraud
for human review. False negatives are much costlier than false positives."""
Does few-shot example balance affect LLM predictions?
How do you evaluate on imbalanced data?
Can synthetic data help with minority classes?
Multi-label challenges, evaluation metrics, confidence estimation, human review, and the failure modes that break classification systems in production.
Multi-Label Classification
How Multi-Label Differs
The structural differences between single-label and multi-label classification are significant:
| Aspect | Single-Label | Multi-Label |
|---|---|---|
| Decision rule | Pick the highest-scoring class | Each class has an independent threshold |
| Output format | One label string | List/set of label strings |
| Evaluation | Accuracy, macro-F1 | Subset accuracy, Hamming loss, per-label F1 |
| Common error | Picking wrong class | Under-tagging or over-tagging |
| Prompt design | "Choose ONE label" | "List ALL applicable labels" |
Challenges with LLMs
LLMs tend to either under-tag (missing applicable labels) or over-tag (applying labels that are only marginally relevant). Specific challenges include:
- Calibration — How confident must the model be to include a label? Each label needs its own inclusion threshold.
- Order effects — The model may apply the first few relevant labels and stop, missing later ones.
- Validation — Checking that a list of labels is valid is harder than checking a single label.
For related guidance on evaluation metrics for multi-label settings, see Topic 7: Metrics That Matter.
Python — Multi-Label LLM Classification
import json
def multi_label_classify(text, labels):
"""Classify text with ALL applicable labels (multi-label)."""
prompt = f"""Analyze the following text and select ALL labels that apply.
Only include a label if it is clearly relevant.
Available labels: {json.dumps(labels)}
Text: "{text}"
Respond with JSON: {{"labels": ["label1", "label2", ...], "reasons": {{"label1": "...", ...}}}}"""
# response = llm.generate(prompt)
# result = json.loads(response)
# return result
pass
# === Multi-label evaluation ===
from sklearn.metrics import hamming_loss, f1_score
import numpy as np
# Binary vectors: [billing, technical, account, general]
y_true = np.array([[1,1,0,0], [0,0,1,1], [1,0,0,0]])
y_pred = np.array([[1,1,0,0], [0,1,1,0], [1,0,0,0]])
print(f"Hamming Loss: {hamming_loss(y_true, y_pred):.3f}")
print(f"Micro F1: {f1_score(y_true, y_pred, average='micro'):.3f}")
print(f"Macro F1: {f1_score(y_true, y_pred, average='macro'):.3f}")
How do you set thresholds for each label independently?
Should you use separate prompts for each label or one combined prompt?
What is Hamming loss and why use it for multi-label?
Metrics That Matter
Choosing the Right Metric
The metric you optimize should reflect the business cost of being wrong:
| Scenario | Optimize | Why |
|---|---|---|
| Fraud detection | Recall | Missing fraud is catastrophic |
| Content moderation | Precision | Over-censoring drives users away |
| Medical triage | Recall (sensitivity) | Missing a critical case is unacceptable |
| Lead qualification | Precision | Sales time wasted on bad leads is expensive |
| General classification | Macro-F1 | Balanced performance across all classes |
Beyond Standard Metrics
Senior candidates distinguish themselves by mentioning operational metrics:
- Abstention rate — How often does the system say "I don't know" and escalate?
- Reviewer overturn rate — How often do humans disagree with the model's label?
- Calibration — When the model says 90% confident, is it right 90% of the time? See Topic 8: Confidence Estimation.
- Confusion matrix patterns — Which specific class pairs are most frequently confused?
These operational metrics connect to Topic 9: Human in the Loop and Topic 10: Production Failure Modes.
Python — Classification Metrics
from sklearn.metrics import (
classification_report, confusion_matrix, f1_score
)
import numpy as np
# Ground truth and model predictions
y_true = ["billing"]*30 + ["technical"]*50 + ["general"]*20
y_pred = ["billing"]*25 + ["technical"]*5 + \
["technical"]*45 + ["billing"]*5 + \
["general"]*18 + ["technical"]*2
# Full classification report with per-class metrics
print(classification_report(y_true, y_pred))
# Macro-F1 treats all classes equally (important for imbalanced data)
macro = f1_score(y_true, y_pred, average="macro")
print(f"Macro F1: {macro:.3f}")
# Confusion matrix reveals which classes are confused
cm = confusion_matrix(y_true, y_pred, labels=["billing","technical","general"])
print("Confusion Matrix:")
print(cm)
# Rows = true, Cols = predicted
# Off-diagonal values show misclassification patterns
When should you use micro-F1 vs macro-F1?
How do you interpret a confusion matrix for LLM classifiers?
What is calibration and why does it matter for classification?
Confidence Estimation
Why Verbal Confidence Fails
When asked "how confident are you?", LLMs tend to be overconfident. They produce high verbal confidence even for wrong answers, because they are trained to generate plausible-sounding text, not calibrated probabilities. A model might say "95% confident" for an answer it gets wrong 40% of the time.
Better Approaches
| Method | How It Works | Trade-Off |
|---|---|---|
| Token logprobs | Extract the log probability the model assigns to the label token | Best signal, but not all APIs expose it |
| Self-consistency | Sample N responses and measure label agreement | Good signal, but costs N API calls per prediction |
| Prompt variants | Rephrase the prompt 3-5 ways and check agreement | Catches prompt sensitivity, moderately expensive |
| Calibration mapping | Use a held-out labeled set to build a mapping from raw scores to true probabilities | Excellent accuracy, requires labeled calibration data |
| Ensemble agreement | Run multiple models and check agreement | Most robust, most expensive |
Production Pattern
Production systems typically combine multiple signals: model scores, retrieval evidence, schema validity, and historical error patterns. The combined signal feeds a routing decision: high-confidence cases are auto-routed, low-confidence cases go to human review. See Topic 9: Human in the Loop.
Python — Self-Consistency Confidence
from collections import Counter
def self_consistency_classify(text, labels, n_samples=5):
"""Run classification N times and use agreement as confidence."""
predictions = []
for _ in range(n_samples):
# Each call uses temperature > 0 for variation
# pred = llm.classify(text, labels, temperature=0.7)
# predictions.append(pred)
pass
# Count label frequencies across samples
counts = Counter(predictions)
top_label, top_count = counts.most_common(1)[0]
# Agreement rate = confidence estimate
confidence = top_count / n_samples
return {
"label": top_label,
"confidence": confidence,
"distribution": dict(counts),
"should_escalate": confidence < 0.6, # Low agreement = uncertain
}
# Example: 5 runs produce ["billing","billing","billing","technical","billing"]
# → label="billing", confidence=0.8, should_escalate=False
# Example: 5 runs produce ["billing","technical","billing","general","technical"]
# → label="billing", confidence=0.4, should_escalate=True
How do you calibrate an LLM classifier?
Is temperature 0 always best for classification?
What threshold should trigger human review?
Human in the Loop
When to Escalate
Route to human review when:
- Low model confidence — The classifier is uncertain (see Topic 8: Confidence Estimation).
- High-impact decision — Account suspension, medical triage, legal compliance.
- Novel input — The input looks different from what the model was trained or prompted on.
- Conflicting signals — Multiple models or prompt variants disagree.
- Frequently confused classes — The specific class pair has a high historical confusion rate.
Human Review as a Data Engine
The most valuable aspect of human review is not the individual decision — it is the data it generates. Human decisions on escalated cases are the highest-quality training and evaluation data available because they represent the hardest, most informative examples. Feed this data back into:
- Evaluation sets — Hard cases make evaluation more realistic.
- Few-shot examples — Resolved edge cases become great prompt examples.
- Fine-tuning data — Labeled hard cases improve model weak spots.
- Taxonomy refinement — Patterns in escalations reveal where labels need clearer boundaries (see Topic 4: Taxonomy Design).
Python — Confidence-Based Routing
def classify_and_route(text, confidence_threshold=0.8):
"""Classify with automatic routing based on confidence."""
# Step 1: Get classification with confidence
result = classify_with_confidence(text)
# Step 2: Route based on confidence
if result["confidence"] >= confidence_threshold:
# High confidence: auto-route
return {
"action": "auto_route",
"label": result["label"],
"confidence": result["confidence"],
}
else:
# Low confidence: escalate to human reviewer
return {
"action": "human_review",
"suggested_label": result["label"],
"confidence": result["confidence"],
"reason": result.get("reason", "Low confidence"),
"priority": "high" if result["confidence"] < 0.5 else "normal",
}
# Step 3: Log human decisions for model improvement
def log_review_decision(text, model_label, human_label):
"""Store human review outcomes for retraining."""
record = {
"text": text,
"model_label": model_label,
"human_label": human_label,
"overturned": model_label != human_label,
}
# Append to training/evaluation dataset
# These hard cases are the most valuable training data
return record
How do you size the human review team?
Should reviewers see the model's prediction?
How do you handle reviewer disagreements?
Production Failure Modes
The System-Level View
The strongest interview answer treats classification quality as a system property, not a model property. Quality depends on the entire pipeline:
- Data definitions — Ambiguous taxonomies produce ambiguous predictions (see Topic 4: Taxonomy Design).
- Prompt stability — If minor prompt tweaks change 10% of predictions, the system is fragile.
- Evaluation coverage — If your test set does not include edge cases and minority classes, you will not detect problems (see Topic 7: Metrics That Matter).
- Routing policies — The confidence threshold determines the auto/escalation split (see Topic 9: Human in the Loop).
- Monitoring — Tracking only aggregate accuracy will miss class-specific degradation.
Common Failure Catalog
| Failure | Symptom | Detection |
|---|---|---|
| Label drift | Old labels appear in output after taxonomy update | Monitor label distribution over time |
| Prompt brittleness | Accuracy drops after minor prompt edit | A/B test prompt changes on evaluation set |
| Format errors | Downstream systems crash on unexpected output | Schema validation on every prediction |
| Minority neglect | Per-class recall near zero for rare classes | Per-class metrics, not just overall accuracy |
| Silent degradation | Accuracy drops without obvious cause | Automated regression testing on evaluation set |
| Upstream shift | Input preprocessing changes alter effective input | Monitor input feature distributions |
Monitoring Best Practices
Production classification systems should monitor:
- Prediction distribution — Alert if the fraction of any class shifts by more than a threshold.
- Confidence distribution — Alert if average confidence drops (may indicate out-of-distribution inputs).
- Escalation rate — Alert if the fraction of human-reviewed cases spikes.
- Per-class metrics — Track precision and recall per class, not just overall accuracy.
Python — Classification Monitoring Pipeline
from collections import Counter
from datetime import datetime
class ClassificationMonitor:
"""Monitor classification system health in production."""
def __init__(self, expected_distribution, drift_threshold=0.1):
self.expected = expected_distribution
self.threshold = drift_threshold
self.predictions = []
def log_prediction(self, label, confidence, text_hash):
"""Log each prediction for monitoring."""
self.predictions.append({
"label": label,
"confidence": confidence,
"timestamp": datetime.utcnow(),
"text_hash": text_hash,
})
def check_distribution_drift(self):
"""Alert if label distribution has shifted significantly."""
counts = Counter(p["label"] for p in self.predictions)
total = sum(counts.values())
alerts = []
for label, expected_frac in self.expected.items():
actual_frac = counts.get(label, 0) / total
drift = abs(actual_frac - expected_frac)
if drift > self.threshold:
alerts.append(f"DRIFT: {label} expected {expected_frac:.0%}, got {actual_frac:.0%}")
return alerts
def check_confidence_drop(self, min_avg=0.7):
"""Alert if average confidence drops below threshold."""
if not self.predictions: return []
avg_conf = sum(p["confidence"] for p in self.predictions) / len(self.predictions)
if avg_conf < min_avg:
return [f"LOW CONFIDENCE: avg={avg_conf:.2f} (threshold={min_avg})"]
return []